X 推文周期观察：AI 研究动态 102 轮选 24 条

核心判断

近期 AI 研究动态集中在 4 条主线：① Physical AI / Omnimodal World Model（NVIDIA Cosmos 3 把图像、视频、音频、动作统一到单一生成器）；② Multimodal Agent（Step 3.7 Flash、Qwen3.7-Plus、MiniMax M3 等以 agent 效率为标杆重新设计 MoE 视觉-语言模型）；③ Agentic RL & Reward（多 agent subagent 嵌套调用的 reward design、Harness-as-product、StepFun 的 ClawEval leaderboard）；④ Speech LLM（OpenAI 实时翻译、AA-WER Streaming 评测、Tencent 通用 audio tokenizer）。

这些推文不是单点新闻，而是 研究执行层面的产业转向：从 "模型越大越好" 转向 "评测驱动 + Harness 决定能力上限" 的工程化共识。

样本机制：102 轮 15 分钟周期

这批样本来自一个持续轮询管线：把 10 个主题按 OR 变体展开，用跨轮 ID 去重，避免把同一条高热推文重复放大。

检索：主题查询按热度排序，每轮限制返回量，并保留结构化结果用于复核。
主题池 10 个，每个主题用 OR 变体拉宽（如 audio language model OR audio LLM OR speech LLM）。
时间窗为最近 7 天，每轮拉到 7 天内容，靠跨轮 ID 表去重。
每 15 分钟触发一次，结果落到结构化轮次记录。
跨轮去重持久化，保证不重复写已观察过的推文。

本笔记报告的是这 102 轮观察后的去重精选与主题分布；下面的统计与判断都来自这批 241 条独立推文。

主题分布

主题	独立推文	平均 ❤	最高 ❤
`reward model RLHF agent OR RLHF agent OR process reward`	22	12	104
`multimodal foundation model OR omni model OR omnimodal`	22	10	73
`agent harness eval OR eval harness OR agent eval`	21	30	299
`audio reasoning model OR audio reasoning OR speech reas`	20	13	173
`audio language model OR audio LLM OR speech LLM`	18	31	222
`agentic RL training OR RL agent training OR RL post-tra`	18	164	1093
`multi-modal LLM agent OR multimodal agent OR VLM agent`	18	219	1501
`RL agent harness OR agent harness OR harness eval`	18	82	741
`agentic reward model OR agentic RL OR agent reward`	15	62	539
`multimodal reasoning agent OR multimodal reasoning OR v`	15	11	78
`multimodal foundation model`	13	36	204
`audio language model`	12	10	76

高产作者与高互动推文

高产作者

作者	推文数
@ArxivSound	9
@AIDailyGems	7
@Chinazhidx	5
@HuggingPapers	4
@tldr_ai_papers	3
@Alibaba_Qwen	3
@SciFi	3
@AINativeF	3
@Gsdata5566	3
@gavinz0228	2
@badlogicgames	2
@hsu_steve	2

互动 Top 10

❤	作者	摘要	链接
❤1501	@StepFun_ai	⚡️ Step 3.7 Flash is here: The new frontier is agent efficiency. #1 ClawEval-1.1 (67.1), #1 SimpleVQA Search (79.2), #2...	→
❤1494	@NousResearch	Step 3.7 Flash is now free for 30 days via Nous Portal It is a new MoE vision-language model focused on agent efficienc...	→
❤1093	@ClementDelangue	Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the t...	→
❤896	@billxbf	Excited to release 🌟Polar🌟, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Her...	→
❤741	@nickbaumann_	Great read -- all it really takes is: - a harness - connectors to your data/tools - reliable, always-accessible agent(s...	→
❤539	@0xbeepit	The next generation of traders won’t trade manually, they’ll deploy agents. Beep is partnering with @BitgetWallet to on...	→
❤455	@adithya_s_k	Introducing Repo2RLEnv Turn any repository into runnable, verifiable coding environments built from real PRs and commit...	→
❤430	@Prince_Canuma	Today we're shipping our biggest MLX-VLM release yet: v0.6.0 ...and we are raising 💸 This one's about turning your App...	→
❤349	@badlogicgames	pibot is now running fully local, using parakeet for STT, qwen3-tts for TTS, and Qwen 3.6 as the local multi-modal LLM v...	→
❤299	@novasarc01	i’m increasingly convinced that the best agent evals will come from mining real agent failure traces. my view is that ev...	→

主题：audio language model

本主题覆盖 30 条独立推文，下面是互动最高的两条。

🖼 @mli0603 · ❤76 👁82512 · Mon Jun 01 07:26

This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI 🚀 - Cosmos 3 is an omnimodal world model: within a unified architecture, it can understand and generate language, images, video, audio, and actions. - It is not just a VLM, not just a video generator, not just an audio-visual generative model, and not just a physics simulator / world-action model. It can understand images and videos, generate images, videos, and audio, simulate future worlds, predict actions, and generate robot policies—enabling models to truly begin to “touch the world.” - Cosmos 3 is the #1 open-weight reasoner / T2I / I2V / robot policy across many benchmarks. Huge thanks to every teammate who fought side by side on this journey—from architecture, data, training, infra, serving, and evaluation to post-training. Every part of this project carries an incredible amount of hard work. This was my first time leading a project as Tech Lead, and I feel truly fortunate. The future of Physical AI needs models that can not only “see” and “describe” the world, but also “imagine,” “simulate,” and “act”—and eventually close the loop with the real world. I hope Cosmos 3 can become an important starting point for this direction, and I’m excited to push Physical AI into its next stage together with the open-source community. Welcome to the era of Physical AI. HuggingFace: https://t.co/QW5h5pIWWM Project Website: https://t.co/Jppa0gkn16 Code: https://t.co/aJgaLm5BaG

🖼 @oprydai · ❤19 👁1494 · Mon Jun 01 04:45

Cosmos 3 is here. this is not just another AI model release. this is NVIDIA moving deeper into Physical AI. why it matters: • it combines language, images, video, audio, and actions • it can reason about the physical world • it can generate worlds, not just pixels • it supports robot action generation • it connects simulation, robotics, and embodied AI the big shift: AI is moving from “predict the next token” to “predict the next state of the world.” that matters for robots. because robots don’t live inside text boxes. they live inside messy rooms, factories, roads, kitchens, warehouses, hospitals, and human environments. Cosmos 3 is basically infrastructure for this: • text → image • image → video • video → world • world → action • action → policy • policy → robot behavior this is where robotics starts getting interesting. not because we suddenly solved embodiment. but because the stack is forming: • world models • simulation • synthetic data • robot policies • multimodal reasoning • physical interaction the future of AI is not only chatbots. it is machines that can see, predict, simulate, and act. Physical AI is becoming real.

🖼 @caydengineer · ❤222 👁99356 · Fri May 29 18:22

OpenAI just dropped a completely new kind of model gpt-realtime-translate takes in speech audio from any language and outputs speech in your target language LLMs are great, but you need specialized models for specialized use cases We're running this on our smart glasses https://t.co/uJnGdL5DlE

🖼 @ArtificialAnlys · ❤146 👁12506 · Thu May 28 15:34

Announcing AA-WER Streaming, our new benchmark measuring streaming Speech to Text models on accuracy and latency for voice agent use cases. Pareto optimal models on this new benchmark include those from Cartesia, ElevenLabs, and Deepgram Streaming Speech to Text (STT) powers real-time transcription in voice agents and live captioning, where models must balance accuracy against speed. Fast transcripts are especially important for keeping responses feeling natural and leaves more of the response-time budget for reasoning and tool calls. Accuracy also matters since transcription errors compound in downstream reasoning and speech generation. Streaming STT models transcribe audio as it is fed in, sharing outputs continuously, unlike offline (batch) models that process the entire file at once and are typically slower. What we measure: AA-WER Streaming reports Word Error Rate and latency together, measured from the moment end of speech is detected, with a Pareto line of increasing accuracy as time to transcript received increases. For direct comparability to offline models on accuracy, we test these streaming models on the same ~8 hours of audio as our offline benchmark, AA-WER v2.0: AA-AgentTalk, Earnings22-Cleaned-AA, VoxPopuli-Cleaned-AA. We measure WER and latency as paired metrics at two points after Silero VAD-detected end of speech: First Final Transcription: WER is measured on the first final-denoted transcript returned after end of speech is detected. Latency is the time in seconds from end of speech to that final-denoted transcript. This is more useful for understanding performance as a standalone streaming transcription model, and for higher accuracy. First Partial Transcription: WER is measured on the first transcript-bearing event (partial or final) returned after end of speech is detected. Latency is the time in seconds from end of speech to that first transcript event. This is more useful for near instantaneous transcription for lower-accuracy tasks like responding to "yes" or "no" questions, or for speculative decoding. Key results: ➤ Highest accuracy on Final after End of Speech: @Cartesia Ink-2 (semantic endpoints) at 3.59% WER, 0.21s latency, followed by ElevenLabs Scribe v2 Realtime (3.64%, 0.14s) and Cartesia Ink-2 (external endpoints) (3.66%, 0.09s) ➤ Highest accuracy on First Partial after End of Speech: @ElevenLabs Scribe v2 Realtime at 3.65% WER, 0.13s latency, followed by Cartesia Ink-2 (external endpoints) (4.33%, 0.07s) and @AssemblyAI U3 Realtime Pro (4.46%, 0.47s) ➤ Fastest transcription: @DeepgramAI Flux leads both Final and Partial at 0.020s and 0.019s respectively (both 7.36% WER). On Final, it's followed by @soniox_ai Realtime and Deepgram Nova-3 Realtime (both 0.06s); on First Partial, it’s followed by @NVIDIA Nemotron 3 ASR 80ms (0.04s) and Soniox Realtime (0.05s) Charts below include a Pareto frontier of accuracy vs. speed, so you can shortlist the models that best fit your latency constraints while still achieving high accuracy. See below for further detail ⬇️

主题：multimodal foundation model

本主题覆盖 35 条独立推文，下面是互动最高的两条。

🖼 @lagerskoy · ❤73 👁9920 · Fri May 29 03:36

GOOGLE JUST SHIPPED GEMINI OMNI AT I/O 2026 LAST WEEK AND THE ENTIRE VIDEO EDITING INDUSTRY HAS 12 MONTHS BEFORE ITS BUSINESS MODEL COLLAPSES. THIS IS NOT FUTURE TECH. THIS IS LIVE TODAY. CONVERSATIONAL VIDEO EDITING THROUGH NATURAL LANGUAGE WITH PHYSICS-AWARE COMPOSITING. MOST PEOPLE WILL NOT REALIZE WHAT JUST HAPPENED FOR 6 MORE MONTHS. Here's what Omni actually does. You upload one video. You describe a change in plain English. The model executes the change at scene-aware fidelity. Add a lion to the floor that respects perspective and lighting. Make the curtains open while preserving the rest of the room. Change a plastic bottle to metal. Fill an empty bottle with water. Make lights flicker on a snap of your fingers. The output is photorealistic. The geometry is correct. The lighting matches the scene. The physics are coherent. This is not Veo 3 generating new video from text. This is Omni editing existing video through conversation. The distinction matters more than the tech press is processing. Now here's the math nobody is doing. A traditional VFX artist charges $150 to $400 per hour to composite a single complex element into existing footage. The work that goes into adding a CGI lion to a hotel room shot with correct lighting and perspective takes 20 to 40 hours of skilled labor. Total cost runs $3,000 to $16,000 per shot. This pipeline executes the same shot in 90 seconds through natural language for the cost of a Google AI Premium subscription. The freelance VFX community on Reddit is already arguing about whether this is real or marketing. The answer is both. The demos are real. The output quality is comparable to mid-tier post-production work. The marketing is also real because Google is positioning this as the consumer-facing video creation surface of the next decade. This is the seventh creative compression of 2026. Claude Design collapsed the design layer. KIMI K2.6 collapsed the coding layer. Rodin Gen 2.5 collapsed the 3D asset layer. 21st dev plus Claude Code collapsed the landing page production layer. The Vision API plus Gemini plus Sora 2 stack collapsed the UGC video production layer. Section Store with AI Conversion Blocks collapsed the e-commerce CRO layer. Gemini Omni just collapsed the video post-production layer. Seven creative industries that defined entire freelance economies 18 months ago are now subscription stacks running under $500 a month combined. The part most builders will miss. The opportunity is not making videos for yourself. The opportunity is what happens when every solo creator has access to mid-tier VFX studio capability for $20 a month. A real estate agent can now produce property tour videos with cinematic lighting adjustments without hiring a film crew. A small business owner can produce product demonstration videos with photorealistic CGI element additions without paying a VFX studio. A teacher can produce educational content with complex visual demonstrations without learning After Effects. The market floor just got lifted by 4 orders of magnitude. Here's what is actually happening across the entire creative stack in 2026. Every layer of professional creative production that required specialized expertise 18 months ago is collapsing into conversational AI interfaces this year. Design. Code. 3D assets. Landing pages. UGC video. E-commerce optimization. Now video post-production. The freelance economy built on selling these specialized skills is being repriced in real time. The clients who paid $5,000 for a UGC video. The brands who paid $15,000 for a landing page. The agencies who charged $25,000 for a product page redesign. The VFX shops who billed $10,000 per CGI shot. All of them are watching their margins compress while their clients ask why the AI version is not good enough. The smart freelancers pivot from execution to strategy. The ones who try to compete on execution against AI lose every contract within 18 months. The window for solo operators to capture the next wave of creative production is open right now. Open Gemini. Upload one video. Describe one change in plain English. Watch what the next decade of creative production looks like. Then decide whether you are positioning to ship at this leverage or watching from the sidelines while others build the next category of creative business with tools that did not exist 30 days ago.

· @KaichunMo · ❤46 👁4830 · Mon Jun 01 05:28

Super proud to be part of NVIDIA's Cosmos3 Physical-AI Omnimodal Foundation Model. Topping the image/video/sound generation benchmarks and Robot Policy Benchmarks 😀

🖼 @intheworldofai · ❤204 👁13877 · Mon Jun 01 01:14

🚨 MiniMax M3 is now available! MiniMax has officially launched MiniMax M3, a new multimodal foundation model with: • 1M token context window 🤯 • Text, image & video inputs • Strong coding + tool use capabilities • Built for long-horizon agentic workflows • Powered by MiniMax Sparse architecture The race for ultra-long context AI models is heating up fast. 👀

🖼 @Alibaba_Qwen · ❤146 👁2834 · Mon Jun 01 17:54

👏👏 Introducing Qwen3.7-Plus — a multimodal agent model that unifies vision and language into one versatile agent foundation. ✅ Multimodal interactive hybrid agent: unified GUI & CLI operation across visual and text tasks ✅ Versatile coding agent & productivity assistant with full-modality input ✅ Visual Agent: perception, reasoning, grounding, and search-augmented QA ✅ Cross-harness generalization across diverse agent frameworks One model. Sees, thinks, codes, acts.🙌🙌 Now available via API on Alibaba Cloud Model Studio. Try it — let us know what you build.😎 🔗🔗⬇️⬇️ Blog：https://t.co/pVYf0h3NNa Qwen Studio：https://t.co/HUYgFW4cYf API：https://t.co/viL0cXrMzW

主题：multimodal reasoning agent

本主题覆盖 27 条独立推文，下面是互动最高的两条。

🖼 @NikhilLamba6 · ❤12 👁1198 · Sun May 31 19:10

Hi all, I thought of a multimodal project couple of months back. Where the agent should be able to follow the env of agentic harness and mimic how computer use agents work. Should utilize: memory, ReAct reasoning loop, tool, context, etc. Constraint: no use of Langchain, CrewAI, AutoGen, DeepAgents or any of these utility libraries. Just raw next token prediction and python. It's always good to use skills, but we should be knowing under the hood stuff as well, right? I have used a screen parsing tool for this: MSFTResearch (huge kudos) OmniParser so that model knows what it is looking into and could decide the clicks and scrolls. Wanna take a look? Github: https://t.co/a8LGDKOwMO Have attached a video and loom link below: https://t.co/ULSV3uq7GA Will be glad to hear your thoughts. #DeepLearning #agenticAI #machinelearning

🖼 @lobehub · ❤11 👁874 · Mon Jun 01 06:24

MiniMax M3 is now live on LobeHub 🚀 We’ve been testing it internally, and the early results look seriously promising💥💥💥 🧠 Strong reasoning ⚡️ Fast responses 👀 Multimodal inputs 🤖 Built for coding & agent workflows 📚 1M context window Full benchmarks and comparisons are coming soon 📊 You can try M3 on LobeHub today：）

· @HuggingPapers · ❤78 👁6713 · Fri May 29 21:17

Multimodal inputs: text, image, and video Up to 262K context Ready for vLLM on Hopper and Blackwell https://t.co/i1rFPjyufZ

· @altiamkabir · ❤41 👁7222 · Wed May 27 18:43

DeepSeek got everyone talking about reasoning. SenseNova U1 is pushing another frontier: unified multimodal generation + understanding in open source. The full training codebase of SenseNova U1 is open sourced now too, which makes this release even more interesting. Worth an upvote. https://t.co/SlmtdlbbRf

主题：audio reasoning model

本主题覆盖 25 条独立推文，下面是互动最高的两条。

🖼 @lmsysorg · ❤17 👁1098 · Mon Jun 01 17:40

Cosmos 3 is now supported in SGLang-Diffusion. Cosmos 3 is NVIDIA’s open world model family for Physical AI, combining vision reasoning, world generation, and action-oriented multimodal modeling across text, images, video, audio, and actions. Serve NVIDIA Cosmos3 generator models (Cosmos3-Nano, Cosmos3-Super, and specialized Super checkpoints) with native SGLang runtime and OpenAI-compatible APIs:

🖼 @HuggingPapers · ❤32 👁2347 · Mon Jun 01 18:01

Tencent just released Universal Audio Tokenizer on Hugging Face A compact single-codebook model that uniquely combines general audio perception and linguistic alignment for seamless Audio-LLM integration. https://t.co/5g7MTwqDIT

· @Trtd6Trtd · ❤30 👁2501 · Sat May 30 00:00

https://t.co/Jj6UA37UYp ASR、TTS、リアルタイム音声対話の3タスクを単一アーキテクチャで統合した音声言語基盤モデル音声トークンも作って、音声とテキストを共有表現空間で学習しているらしい ASRは音声入力という外部証拠があるため、LLMの自由形式テキスト生成よりも積極的に投機的デコードが使えるというのが知見だった

主题：agentic RL training

本主题覆盖 22 条独立推文，下面是互动最高的两条。

· @hsu_steve · ❤33 👁5515 · Sun May 31 01:05

Underappreciated: The impact of high quality data on LLM training. Pretraining data: mostly cleaned and curated by humans, but in some cases by big models using lots of inference tokens. But for agentic post-training, need environments to RL tool use, actions, etc. Not sure about $10-15 billion per lab - only the 3 big US labs could afford that.

🖼 @ClementDelangue · ❤1093 👁1003032 · Fri May 29 01:44

Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get weird. Loss spikes for no reason. Eventually a shape-mismatch error. The culprit: every time you parse the model's output to detect a tool call, then re-tokenize the updated conversation for the next turn, you're rolling the dice. Usually the round-trip gives back the same tokens. Sometimes it doesn't and your gradient lands on a sequence the model never actually sampled. No crash. Just quietly wrong math and a useless gradient signal. The fix is one rule: never re-encode tokens you've decoded. Keep the sampled tokens in one buffer, never re-render them, and both failure modes disappear. That's Token-In, Token-Out done right. Our team just published a beautiful deep-dive on exactly this, including an audit across the major open-weights model families showing most chat templates already support it. Required reading if you're doing multi-turn RL 🤗🔥 https://t.co/zmx0EQl3jM

🖼 @adithya_s_k · ❤455 👁57648 · Thu May 28 13:32

Introducing Repo2RLEnv Turn any repository into runnable, verifiable coding environments built from real PRs and commits for coding-agent evaluation or RL training > uv pip install repo2rlenv https://t.co/nOHVATWcs6

主题：reward model RLHF agent

本主题覆盖 22 条独立推文，下面是互动最高的两条。

· @HeMuyu0327 · ❤52 👁7060 · Fri May 29 22:03

The multi-agent RL terrain is beautiful to watch, besides Claude's launch of Workflow. For one, Kimi's RL work on "agent swarm" is a masterclass on reward design: three weighted reward terms covering output success, how many subagents there are, and how many of their subtasks are successful. Right now, this type of multi-agent RL is focusing on the scenario of one main agent + several subagents, and we are already now in the age of subagents calling subagents calling subagents in a nested loop. It will be very interesting to see what kind of reward design we can come up with this "recursive" multi-agent or if it's ever needed. Kimi explicitly mentions in the report that since the reward for the subagents is hard to define, they only update the weights of the main agent.

· @DJiafei · ❤35 👁2686 · Wed May 27 18:08

Thrilled to see TOPReward integrated into @LeRobotHF ! Since releasing this zero-shot approach, we've been blown away by community adoption — it's emerged as one of the top universal reward models, with applications reaching well beyond robotics. With HF's support, we can't wait to see what people build with TOPReward next!

主题：agentic reward model

本主题覆盖 16 条独立推文，下面是互动最高的两条。

🖼 @Jia__Guo · ❤228 👁32575 · Tue May 26 15:11

Curious about the secret sauce behind our trillion-scale agentic foundation model? Here it comes!🥳 Last year, we released IcePop to stabilize MoE RL with double-sided masking. As we dive deeper, something unexpected happened: the masking ratio went down, while the training–inference mismatch continued to grow!😞 This year, we introduce 𝑲𝑷𝒐𝒑🪩, which replaces the fixed ratio constraint with the binary KL divergence to adaptively mask inappropriate tokens! The masking ratio adapts to fluctuations of the training–inference gap during training, keeping policy optimization stable and effective with long-horizon agentic RL rollouts. With this simple change, it enables our Ring-2.6-1T to achieve over 76 on the SWE-bench-Verified with pure RL training! No modifications to infrastructure. No routing replay. Just one parameter, power your agentic RL with 𝑲𝑷𝒐𝒑! Click to learn more about the details! 📜Blog: https://t.co/uPu1gMg7ti

🖼 @arcprize · ❤96 👁5492 · Wed May 27 17:17

A new ARC Prize 2026 - ARC-AGI-3 Kaggle notebook to help you get started with your first submission Get started in 3 make commands First $35K milestone prize (top score) will be awarded on June 30th, 2026 https://t.co/zz39tEEg2M

主题：RL agent harness

本主题覆盖 18 条独立推文，下面是互动最高的两条。

🖼 @nickbaumann_ · ❤741 👁149280 · Sat May 30 14:41

Great read -- all it really takes is: - a harness - connectors to your data/tools - reliable, always-accessible agent(s) The models have reached the inflection point where it's not more complicated than this https://t.co/lrj0I94FZb

· @julian__duru · ❤141 👁16456 · Mon Jun 01 09:02

We have been building Harnessy. Harnessy is an agent capability harness for software projects and agent runtimes: the layer that tells an agent what it can do, what context matters, and how to verify the environment.

主题：agent harness eval

本主题覆盖 27 条独立推文，下面是互动最高的两条。

· @novasarc01 · ❤299 👁54443 · Fri May 29 16:49

i’m increasingly convinced that the best agent evals will come from mining real agent failure traces. my view is that every failed trace contains a potential eval but not in its raw form. raw traces are messy, long and too specific. the research problem is to distill them into clean reproducible tests. the pipeline i’m interested in is (which i'm currently working on): failure trace → failure attribution → earliest divergence point → minimal reproducible state → targeted eval → regression suite this turns trace data from passive observability into an active improvement loop. like can we extract the exact decision point where the agent should have behaved differently? and can we convert that into an eval that catches the same failure class in the future? i guess this matters because most agent failures are trajectory-level failures and not just output-level failures. personally i think this is much more realistic than relying only on hand-written benchmarks (imo they should look more like failure memory systems). hand-written evals encode what we think agents will fail on. traces encode what agents actually failed on. also once you have the mechanism, you can mutate the trace into variants. that is basically fuzzing for agents.

· @LangChain · ❤77 👁10676 · Thu May 28 15:49

Evals shape agent behavior. Every eval is a vector that shifts the behavior of your agentic system. More evals ≠ better agents. Instead, build targeted evals that reflect desired behaviors in production. Tools like LangSmith Engine help you targetedly create evals from your tracing data to build better agents.

边界与风险

样本偏差：X 搜索接口走的是登录态拦截，对未登录用户/不同地理位置结果不同。
热度偏差：按热度排序拉取，热门公司（NVIDIA / OpenAI / StepFun / Qwen）的推文占比较高，独立研究者声音被稀释。
主题去重：同一推文可能被多个主题命中（OR 变体重叠），跨轮 ID 表保证推文级去重，但主题首次命中标签会随第一次出现而定。
互动数据滞后：X 上早期低互动的优质推文会在 15 分钟内被新推文挤出 top 10，被每轮抓取数量上限截断。

术语解释与概念边界

Cycle digest: 周期性抓取和筛选信息流后形成的精选摘要。它的价值在于趋势归纳，而不是逐条新闻搬运。
Top 排序偏差: 平台热度排序会放大高互动账号和热门公司，不能代表研究质量或长期重要性。
去重: 同一主题可能被多个关键词命中，需要按推文、作者、链接和主题做去重，避免重复计数。
证据边界: 社媒动态适合发现线索，但结论仍应回到论文、代码、官方文档或可复现实验上核验。

证据边界与资料索引

完整推文原文：all-tweets.json
边界：周期观察可能漏掉两次轮换之间发布且很快被新内容覆盖的低权重推文；主题轮换基于关键词命中，长尾话题的覆盖率会比 audio LM / RL 这类高频领域低。
未确认：本笔记基于 24 小时内 102 轮观察的快照，跨周聚合趋势需要更长窗口的样本才能稳定。

核心判断

样本机制：102 轮 15 分钟周期

主题分布

高产作者与高互动推文

高产作者

互动 Top 10

主题：audio language model

🖼 @mli0603 · ❤76 👁82512 · Mon Jun 01 07:26

🖼 @oprydai · ❤19 👁1494 · Mon Jun 01 04:45

🖼 @caydengineer · ❤222 👁99356 · Fri May 29 18:22

🖼 @ArtificialAnlys · ❤146 👁12506 · Thu May 28 15:34

主题：multimodal foundation model

🖼 @lagerskoy · ❤73 👁9920 · Fri May 29 03:36

· @KaichunMo · ❤46 👁4830 · Mon Jun 01 05:28

🖼 @intheworldofai · ❤204 👁13877 · Mon Jun 01 01:14

🖼 @Alibaba_Qwen · ❤146 👁2834 · Mon Jun 01 17:54

主题：multimodal reasoning agent

🖼 @NikhilLamba6 · ❤12 👁1198 · Sun May 31 19:10

🖼 @lobehub · ❤11 👁874 · Mon Jun 01 06:24

· @HuggingPapers · ❤78 👁6713 · Fri May 29 21:17

· @altiamkabir · ❤41 👁7222 · Wed May 27 18:43

主题：audio reasoning model

🖼 @lmsysorg · ❤17 👁1098 · Mon Jun 01 17:40

🖼 @HuggingPapers · ❤32 👁2347 · Mon Jun 01 18:01

· @Trtd6Trtd · ❤30 👁2501 · Sat May 30 00:00

主题：agentic RL training

· @hsu_steve · ❤33 👁5515 · Sun May 31 01:05

🖼 @ClementDelangue · ❤1093 👁1003032 · Fri May 29 01:44

🖼 @adithya_s_k · ❤455 👁57648 · Thu May 28 13:32

主题：reward model RLHF agent

· @HeMuyu0327 · ❤52 👁7060 · Fri May 29 22:03

· @DJiafei · ❤35 👁2686 · Wed May 27 18:08

主题：multi-modal LLM agent

🖼 @StepFun_ai · ❤1501 👁323998 · Fri May 29 00:00

🖼 @NousResearch · ❤1494 👁1251250 · Sat May 30 13:59

主题：agentic reward model

🖼 @Jia__Guo · ❤228 👁32575 · Tue May 26 15:11

🖼 @arcprize · ❤96 👁5492 · Wed May 27 17:17

主题：RL agent harness

🖼 @nickbaumann_ · ❤741 👁149280 · Sat May 30 14:41

· @julian__duru · ❤141 👁16456 · Mon Jun 01 09:02

主题：agent harness eval

· @novasarc01 · ❤299 👁54443 · Fri May 29 16:49

· @LangChain · ❤77 👁10676 · Thu May 28 15:49

边界与风险

术语解释与概念边界

证据边界与资料索引