第一部分· 精选论文
1. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
•arXiv: 2605.10942v1· PDF
Abstract / 摘要
•EN: World Action Models (WAMs) have emerged as a promising paradigm for robot control by modeling physical dynamics. Current WAMs generally follow two paradigms: the “Imagine-then-Execute” approach, which uses video prediction to infer actions via inverse dynamics, and the “Joint Modeling” approach, which jointly models actions and video representations. Based on systematic experiments, we observe a fundamental trade-off between these paradigms: the former explicitly leverages world models for generalizable transit but lacks interaction precision, whereas the latter enables fine-grained, temporally coherent action generation but is constrained by the exploration space of the training distribution. Motivated by these findings, we propose HarmoWAM, an end-to-end WAM that fully leverages a world model to unify predictive and reactive control, enabling both generalizable transit and precise manipulation. Specifically, the world model provides spatio-temporal physical priors that condition two complementary action experts: a predictive expert that leverages latent dynamics for iterative action generation, and a reactive expert that directly infers actions from predicted visual evolution. To enable adaptive coordination, a Process-Adaptive Gating Mechanism is proposed to automatically determine the timing and location of switching between them. This allows the world model to drive the reactive expert to expand the exploration space and the predictive expert to perform precise interactions across different stages of a task. For evaluation, we construct three training-unseen test environments across six real-world robotic tasks, covering variations in background, position, and object semantics. Notably, HarmoWAM achieves strong zero-shot generalization across these scenarios, significantly outperforming prior state-of-the-art VLA models and WAMs by margins of 33% and 29%, respectively.
•中:世界动作模型(WAMs)在机器人控制中展现出潜力,但现有范式在泛化性移动和精确操作之间存在权衡。本文提出 HarmoWAM,一种端到端 WAM,利用世界模型统一预测和反应控制。该方法通过时空物理先验条件化两个互补的动作专家:利用潜在动力学进行迭代生成的预测专家,以及从预测视觉演化直接推断动作的反应专家。过程自适应门控机制自动决定两者切换的时机和位置。在六个真实世界机器人任务中,HarmoWAM 在未见过的测试环境中实现了强大的零样本泛化,显著优于最先进的 VLA 模型和 WAMs。
2. Unified Noise Steering for Efficient Human-Guided VLA Adaptation
•arXiv: 2605.10821v1· PDF
Abstract / 摘要
•EN: Diffusion-based vision-language-action (VLA) models have emerged as strong priors for robotic manipulation, yet adapting them to real-world distributions remains challenging. In particular, on-robot reinforcement learning (RL) is expensive and time-consuming, so effective adaptation depends on efficient policy improvement within a limited budget of real-world interactions. Noise-space RL lowers the cost by keeping the pretrained VLA fixed as a denoising generator while updating only a lightweight actor that predicts the noise. However, its performance is still limited due to inefficient autonomous exploration. Human corrective interventions can reduce this exploration burden, but they are naturally provided in action space, whereas noise-space finetuning requires supervision over noise variables. To address these challenges, we propose UniSteer, a Unified Noise Steering framework that combines human corrective guidance with noise-space RL through approximate action-to-noise inversion. Given a human corrective action, UniSteer inverts the frozen flow-matching decoder to recover a noise target, which provides supervised guidance for the same noise actor that is simultaneously optimized via reinforcement learning. Real-world experiments on diverse manipulation tasks show that UniSteer adapts more efficiently than strong noise-space RL and action-space human-in-the-loop baselines, improving the success rate from 20% to 90% in 66 minutes on average across four real-world adaptation tasks.
•中:基于扩散的视觉 - 语言 - 动作(VLA)模型是机器人操作的强先验,但适应真实世界分布具有挑战性。噪声空间强化学习(RL)降低了成本,但自主探索效率有限。人类纠正干预通常在动作空间提供,而噪声空间微调需要噪声变量监督。为此,我们提出 UniSteer,一个统一噪声导向框架,通过近似动作到噪声逆变将人类纠正指导与噪声空间 RL 结合。UniSteer 反转冻结的流匹配解码器以恢复噪声目标,为噪声演员提供监督指导。真实世界实验表明,UniSteer 比强噪声空间 RL 和动作空间人机回环基线更高效地适应,在四个真实世界适应任务中平均 66 分钟内将成功率从 20% 提高到 90%。
3. VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
•arXiv: 2605.10485v1· PDF
Abstract / 摘要
•EN: Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in representations that lack accurate spatial awareness. Existing implicit spatial grounding methods partially address this by aligning VLA features with those of 3D-aware foundation models, but they rely on empirical layer search and perform alignment on LLM-level visual tokens where spatial structure has already been entangled with linguistic semantics, limiting both generalizability and geometric interpretability. We propose VEGA (Visual Encoder Grounding Alignment), a simple yet effective framework that directly aligns the output of the VLA’s visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view consistent 3D Gaussian Splatting supervision. By performing alignment at the visual encoder output level, VEGA grounds spatial awareness before any linguistic entanglement occurs, offering a more interpretable and principled alignment target. The alignment is implemented via a lightweight projector trained with a cosine similarity loss alongside the standard action prediction objective, and is discarded at inference time, introducing no additional computational overhead. Extensive experiments on simulation benchmark and real-world manipulation tasks demonstrate that VEGA consistently outperforms existing implicit spatial grounding baselines, establishing a new state-of-the-art among implicit spatial grounding methods for VLA models.
•中:精确的空间推理对机器人操作至关重要,但当前 VLA 模型的视觉骨干主要在 2D 图像上预训练,缺乏准确的 3D 空间感知。现有隐式空间接地方法在 LLM 级视觉 token 上对齐,限制了泛化性和几何可解释性。我们提出 VEGA,直接对齐 VLA 视觉编码器输出与 DINOv2-FiT3D 的空间感知特征。通过在视觉编码器输出级进行对齐,VEGA 在语言纠缠发生前接地空间感知。对齐通过轻量级投影器实现,推理时丢弃,无额外计算开销。实验表明 VEGA 在模拟基准和真实世界任务中 consistently 优于现有隐式空间接地基线。
4. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
•arXiv: 2605.10426v1· PDF
Abstract / 摘要
•EN: Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/potatochip1211/CoWorld-VLA.
•中:视觉 - 语言 - 动作(VLA)模型是端到端自动驾驶的有前景范式,但现有推理机制难以提供面向规划的中间表示。本文提出 CoWorld-VLA,一种用于自动驾驶的多专家世界推理框架,其中世界表示作为显式条件指导动作规划。CoWorld-VLA 通过多源监督提取互补世界信息并编码为专家 token,提供规划可访问的条件信号。具体构建了四种 token:语义交互、几何结构、动态演化和本体轨迹 token。动作生成采用基于扩散的分层多专家融合规划器。实验表明 CoWorld-VLA 在 NAVSIM v1 基准的未来场景生成和规划上取得竞争性结果。
5. PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
•arXiv: 2605.10925v1· PDF
Abstract / 摘要
•EN: Large-scale pretraining has made Vision-Language-Action (VLA) models promising foundations for generalist robot manipulation, yet adapting them to downstream tasks remains necessary. However, the common practice of full fine-tuning treats pretraining as initialization and can shift broad priors toward narrow training-distribution patterns. We propose PriorVLA, a novel framework that preserves pretrained priors and learns to leverage them for effective adaptation. PriorVLA keeps a frozen Prior Expert as a read-only prior source and trains an Adaptation Expert for downstream specialization. Expert Queries capture scene priors from the pretrained VLM and motor priors from the Prior Expert, integrating both into the Adaptation Expert to guide adaptation. Together, PriorVLA updates only 25% of the parameters updated by full fine-tuning. Across RoboTwin 2.0, LIBERO, and real-world tasks, PriorVLA achieves stronger overall performance than full fine-tuning and state-of-the-art VLA baselines, with the largest gains under out-of-distribution (OOD) and few-shot settings. PriorVLA improves over pi0.5 by 11 points on RoboTwin 2.0-Hard and achieves 99.1% average success on LIBERO. Across eight real-world tasks and two embodiments, PriorVLA reaches 81% in-distribution (ID) and 57% OOD success with standard data. With only 10 demonstrations per task, PriorVLA reaches 48% ID and 32% OOD success, surpassing pi0.5 by 24 and 22 points, respectively.
•中:大规模预训练使 VLA 模型成为通用机器人操作的基础,但适应下游任务仍需微调。全量微调将预训练视为初始化,可能将广泛先验转向狭窄训练分布模式。我们提出 PriorVLA,保留预训练先验并学习利用它们进行有效适应。PriorVLA 保持冻结的先验专家作为只读先验源,并训练适应专家进行下游专业化。专家查询捕获场景先验和运动先验。PriorVLA 仅更新全量微调 25% 的参数。在 RoboTwin 2.0、LIBERO 和真实世界任务中,PriorVLA 整体性能优于全量微调和 SOTA 基线,在分布外和少样本设置下增益最大。
6. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
•arXiv: 2605.10921v1· PDF
Abstract / 摘要
•EN: Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task coverage and structural complexity, and remain restricted to simulation without real-world evaluation. We address this gap with RoboMemArena, a large-scale benchmark of 26 tasks, with average trajectory lengths exceeding 1,000 steps per task and 68.9% of subtasks being memory-dependent. The generation pipeline leverages a vision-language model (VLM) to design and compose subtasks, generates full trajectories through atomic functions, and provides memory-related annotations, including subtask instructions and native keyframe annotations, while paired real-world memory tasks support physical evaluation. We further design PrediMem, a dual-system VLA in which a high-level VLM planner manages a memory bank with recent and keyframe buffers and uses a predictive coding head to improve sensitivity to task dynamics. Extensive experiments on RoboMemArena show that PrediMem outperforms all baselines and provides insights into memory management, model architecture, and scaling laws for complex memory systems.
•中:记忆是机器人智能的关键组件,但现有机器人记忆基准缺乏记忆形成的多模态注释,任务覆盖和结构复杂性有限,且仅限于模拟。我们提出 RoboMemArena,一个包含 26 个任务的大规模基准,平均轨迹长度超过 1000 步,68.9% 的子任务依赖记忆。生成流程利用 VLM 设计子任务并提供记忆相关注释。我们进一步设计了 PrediMem,一种双系统 VLA,其中高层 VLM 规划器管理记忆库并使用预测编码头提高对任务动态的敏感性。实验表明 PrediMem 优于所有基线,并为记忆管理提供见解。
7. MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems
•arXiv: 2605.10904v1· PDF
Abstract / 摘要
•EN: Vehicle-to-Everything (V2X) communication has emerged as a promising paradigm for autonomous driving, enabling connected agents to share complementary perception information and negotiate with each other to benefit the final planning. Existing V2X benchmarks, however, fall short in two ways: (i) open-loop evaluations fail to capture the inherently closed-loop nature of driving, leading to evaluation gaps, and (ii) current closed-loop evaluations lack behavioral and interactive diversity to reflect real-world driving. Thus, it is still unclear the extent of benefits of multi-agent systems for closed-loop driving. In this paper, we introduce MDrive, a closed-loop cooperative driving benchmark comprising 225 scenarios grounded in both NHTSA pre-crash typologies and real-world V2X datasets. Our benchmark results demonstrate that multi-agent systems are generally better than single-agent counterparts. However, current multi-agent systems still face two important challenges: (i) perception sharing enhances perceptions, but doesn’t always translate to better planning; (ii) negotiation improves planning performance but harms it in complex and dense traffic scenarios. MDrive further provides an open-source toolbox for scenario generation, Real2Sim conversion, and human-in-the-loop simulation. Together, MDrive establishes a reproducible foundation for evaluating and improving the generalization and robustness of cooperative driving systems.
•中:车联万物(V2X)通信是自动驾驶的有前景范式,但现有 V2X 基准存在不足:开环评估无法捕捉闭环本质,闭环评估缺乏行为多样性。本文介绍 MDrive,一个包含 225 个场景的闭环协作驾驶基准。结果表明多智能体系统通常优于单智能体,但仍面临挑战:感知共享增强感知但不一定转化为更好规划;谈判改善规划但在复杂密集交通中损害性能。MDrive 提供开源工具箱用于场景生成、Real2Sim 转换和人机回环模拟,为评估协作驾驶系统建立可复现基础。
8. Is Your Driving World Model an All-Around Player?
•arXiv: 2605.10858v1· PDF
Abstract / 摘要
•EN: Today’s driving world models can generate remarkably realistic dash-cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed-loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world-model fidelity across the full spectrum, from pixel quality and 4D geometry to closed-loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture-rich models violate geometry, geometry-aware models lack behavioral fidelity, and even the strongest performers achieve only 2-3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, and WorldLens-Agent, a vision-language evaluator distilled from these judgments that enables scalable, explainable auto-assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.
•中:当今的驾驶世界模型能生成逼真的行车记录仪视频,但没有单一模型普遍卓越。有些纹理逼真但违反物理,有些几何一致但闭环规划失败。我们引入 WorldLens,一个统一基准,通过五个互补方面和 24 个标准化维度衡量世界模型保真度。对六个代表性模型的评估显示没有现有方法在所有轴上占优。为弥合算法指标与人类感知差距,我们贡献 WorldLens-26K 人类标注偏好数据集和 WorldLens-Agent 视觉语言评估器。基准、数据集和代理形成统一生态系统,评估生成世界不仅基于视觉吸引力,还基于物理和行为保真度。
9. ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models
•arXiv: 2605.10819v1· PDF
Abstract / 摘要
•EN: Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM’s locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.
•中:视觉 - 语言 - 动作(VLA)模型受限于动作标注机器人数据稀缺,而无动作视频提供丰富的物理世界变化证据。潜在动作模型可从视频提取先验,但重建训练的潜在代码不一定适合策略生成。我们引入 ALAM,一种代数一致潜在动作模型,将无动作视频中的时间关系转化为结构监督。ALAM 学习由重建接地并由组合和反转一致性正则化的潜在转换。对于下游 VLA 学习,冻结预训练编码器并使用其潜在转换序列作为辅助生成目标。表示探针显示 ALAM 减少加性和可逆性错误,转移至 VLA 策略后在 MetaWorld 和 LIBERO 上提高平均成功率。
10. MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction
•arXiv: 2605.10760v1· PDF
Abstract / 摘要
•EN: Collaborative photorealistic 3D reconstruction from multiple agents enables rapid large-scale scene capture for virtual production and cooperative multi-robot exploration. While recent 3D Gaussian Splatting (3DGS) SLAM algorithms can generate high-fidelity real-time mapping, most of the existing multi-agent Gaussian SLAM methods still rely on RGB-D sensors to obtain metric depth and simplify cross-agent alignment, which limits the deployment on lightweight, low-cost, or power-constrained robotic platforms. To address this challenge, we propose MAGS-SLAM, the first RGB-only multi-agent 3DGS SLAM framework for collaborative scene reconstruction. Each agent independently builds local monocular Gaussian submaps and transmits compact submap summaries rather than raw observations or dense maps. To facilitate robust collaboration in the presence of monocular scale ambiguity, our framework integrates compact submap communication, geometry- and appearance-aware loop verification, and occupancy-aware Gaussian fusion, enabling coherent global reconstruction without active depth sensors. We further introduce ReplicaMultiagent Plus benchmark for evaluating collaborative Gaussian SLAM. Intensive experiments on synthetic and real-world datasets show that MAGS-SLAM achieves competitive tracking accuracy and comparable or superior rendering quality to state-of-the-art RGB-D collaborative Gaussian SLAM methods while relying only RGB images.
•中:多智能体协作逼真 3D 重建可实现快速大规模场景捕捉。现有 3D 高斯泼溅(3DGS)SLAM 算法大多依赖 RGB-D 传感器获取度量深度,限制了在轻量级平台上的部署。为此,我们提出 MAGS-SLAM,首个仅 RGB 多智能体 3DGS SLAM 框架。每个智能体独立构建局部单目高斯子图并传输紧凑子图摘要。框架集成紧凑子图通信、几何和外观感知回路验证及占用感知高斯融合,实现无主动深度传感器的连贯全局重建。在合成和真实世界数据集上的实验表明,MAGS-SLAM 仅依赖 RGB 图像即实现竞争性跟踪精度和渲染质量。
11. C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving
•arXiv: 2605.10744v1· PDF
Abstract / 摘要
•EN: Safety-critical planning in complex environments, particularly at urban intersections, remains a fundamental challenge for autonomous driving. Existing methods, whether rule-based or data-driven, frequently struggle to capture complex scene semantics, infer potential risks, and make reliable decisions in rare, high-risk situations. While vision-language models (VLMs) offer promising approaches for safe decision-making in these environments, most current approaches lack reflective and causal reasoning, thereby limiting their overall robustness. To address this, we propose a counterfactual chain-of-thought (C-CoT) framework that leverages VLMs to decompose driving decisions into five sequential stages: scene description, critical object identification, risk prediction, counterfactual risk reasoning, and final action planning. Within the counterfactual reasoning stage, we introduce a structured meta-action evaluation tree to explicitly assess the potential consequences of alternative action combinations. This self-reflective reasoning establishes causal links between action choices and safety outcomes, improving robustness in long-tail and out-of-distribution scenarios. To validate our approach, we construct the DeepAccident-CCoT dataset based on the DeepAccident benchmark and fine-tune a Qwen2.5-VL (7B) model using low-rank adaptation. Our model achieves a risk prediction recall of 81.9%, reduces the collision rate to 3.52%, and lowers L2 error to 1.98 m. Ablation studies further confirm the critical role of counterfactual reasoning and the meta-action evaluation tree in enhancing safety and interpretability.
•中:复杂环境中的安全关键规划,特别是城市交叉口,仍是自动驾驶的根本挑战。现有方法难以捕捉复杂场景语义和推断潜在风险。虽然 VLM 提供有前景的方法,但缺乏反思和因果推理。为此,我们提出反事实思维链(C-CoT)框架,利用 VLM 将驾驶决策分解为五个顺序阶段:场景描述、关键物体识别、风险预测、反事实风险推理和最终动作规划。在反事实推理阶段引入结构化元动作评估树。基于 DeepAccident 基准构建 DeepAccident-CCoT 数据集并微调 Qwen2.5-VL 模型。模型实现 81.9% 风险预测召回率,将碰撞率降至 3.52%。
12. Decentralized Contingency MPC based on Safe Sets for Nonlinear Multi-agent Collision Avoidance
•arXiv: 2605.10738v1· PDF
Abstract / 摘要
•EN: Decentralized collision avoidance remains challenging, particularly when agents do not communicate any information related to planned trajectories. Most existing approaches either rely on conservative coordination mechanisms or provide limited guarantees on recursive feasibility and convergence. This paper develops a decentralized contingency MPC framework for multi-agent systems with nonlinear dynamics that achieves collision-free motion under a state-only information pattern. Each agent follows the same consensual rule set, enabling safe decentralized planning without communication. Each agent solves a local optimization problem that couples a nominal trajectory with a contingency certificate ensuring a feasible backup maneuver under receding-horizon operation. A novel geometric and decentralized safe-set update mechanism prevents feasibility loss between consecutive time steps. The resulting scheme guarantees recursive feasibility, including collision avoidance, and establishes a Lyapunov-type convergence result to an admissible safe equilibrium. Simulation results demonstrate performance in both sparse and dense multi-agent environments, including cluttered bottleneck scenarios and under plug-and-play operation.
•中:去中心化碰撞避免具有挑战性,特别是当智能体不通信计划轨迹信息时。现有方法依赖保守协调机制或提供有限的递归可行性和收敛保证。本文开发了一种用于具有非线性动态的多智能体系统的去中心化应急 MPC 框架,在仅状态信息模式下实现无碰撞运动。每个智能体遵循相同共识规则集,实现无需通信的安全去中心化规划。每个智能体求解耦合名义轨迹和应急证书的局部优化问题。新颖的几何和去中心化安全集更新机制防止连续时间步间的可行性损失。该方案保证递归可行性和李雅普诺夫型收敛结果。
(未提取到代表性大图,一般为双栏首图;已降级为略过,请见原文 PDF。)
13. XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies
•arXiv: 2605.10734v1· PDF
Abstract / 摘要
•EN: For reinforcement learning in the real world online exploration is expensive A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency Expert demonstration data is often crucial for solving hard exploration tasks with sparse rewards While prior data is used to augment experience and pretrain models we show that the design of existing algorithms fails to achieve the sample efficiency that is possible in this setting due to a failure to use pretrained policies effectively We propose XQCfD which extends the sample-efficient XQC actor-critic to learn from demonstrations using augmented replay buffers pretrained policies and stationary policy architectures designed to avoid rapidly unlearning the strong initial policy like prior works We show our stationary network architecture enables policy improvement out-of-distribution better than standard network architectures due to its higher entropy predictions XQCfD achieves state of the art performance across a range of complex manipulation tasks with sparse rewards from the popular Adroit Robomimic and MimicGen benchmarks – notably with a low update-to-data ratio and no ensemble networks
•中:现实世界强化学习中在线探索成本高昂。机器人 RL 中常见做法是结合额外数据提高样本效率。现有算法设计未能有效利用预训练策略实现可能的样本效率。我们提出 XQCfD,扩展样本高效 XQC 演员 - 评论家,利用增强回放缓冲区、预训练策略和旨在避免快速遗忘强初始策略的静态策略架构学习演示。显示静态网络架构实现分布外策略改进优于标准架构。XQCfD 在一系列复杂操作任务中实现 SOTA 性能,具有低更新数据比且无需集成网络。
14. VRA: Grounding Discrete-Time Joint Acceleration in Voltage-Constrained Actuation
•arXiv: 2605.10696v1· PDF
Abstract / 摘要
•EN: Discrete-time joint acceleration constraints are widely used to enforce position and velocity limits. However, under voltage-constrained electric actuators, kinematically admissible accelerations may be physically unrealizable, exposing a missing execution-level abstraction. We propose Voltage-Realizable Acceleration (VRA), a joint-level acceleration interface that grounds kinematic acceleration in voltage-constrained actuator physics by restricting commanded accelerations to voltage-realizable constraints. Hardware experiments on electric actuators and a wheel-legged quadruped show that VRA removes unrealizable accelerations, restores consistent near-constraint execution, and reduces constraint-induced oscillations.
•中:离散时间关节加速度约束广泛用于执行位置和速度限制。但在电压约束电动执行器下,运动学可接受的加速度可能在物理上不可实现,暴露了缺失的执行级抽象。我们提出电压可实现加速度(VRA),一种关节级加速度接口,通过将命令加速度限制在电压可实现约束内,将运动学加速度接地于电压约束执行器物理中。电动执行器和轮足四足机器人的硬件实验表明,VRA 消除了不可实现的加速度,恢复了连贯的近似约束执行,并减少了约束引起的振荡。
15. Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
•arXiv: 2605.10588v1· PDF
Abstract / 摘要
•EN: Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a paradigm that integrates generative novel-view synthesis into the reasoning loop: a Reasoner LMM identifies spatial ambiguity, instructs a Painter to synthesize an alternative viewpoint, and re-examines the scene with the additional evidence. Through systematic experiments we address three research questions. (1) Instruction format: numerical camera-pose specifications yield more reliable view control than free-form language. (2) Generation fidelity: synthesized view quality is tightly coupled with downstream spatial accuracy. (3) Inference-time visual scaling: iterative multi-turn view refinement further improves performance, echoing recent scaling trends in language reasoning. Across four spatial subtask categories and four LMM architectures (both closed- and open-source), TwNV consistently improves accuracy by +1.3 to +3.9 pp, with the largest gains on viewpoint-sensitive subtasks. These results establish novel-view generation as a practical lever for advancing spatial intelligence of LMMs.
•中:当前大型多模态模型(LMMs)在处理需要视点依赖理解的空间推理任务时存在困难,主要因为它们局限于单一静态观察。我们提出“利用新视点思考”(TwNV)范式,将生成新视点合成整合到推理循环中:推理器 LMM 识别空间歧义,指示画家合成替代视点,并用额外证据重新检查场景。通过系统实验回答三个研究问题。跨四个空间子任务类别和四个 LMM 架构,TwNV consistently 提高准确率,在视点敏感子任务上增益最大。这些结果确立了新视点生成作为推进 LMMs 空间智能的实用杠杆。
16. CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations
•arXiv: 2605.10586v1· PDF
Abstract / 摘要
•EN: Learning a physical model from video data that can comprehend physical laws and predict the future trajectories of objects is a formidable challenge in artificial intelligence. Prior approaches either leverage various Partial Differential Equations (PDEs) as soft constraints in the form of PINN losses, or integrate physics simulators into neural networks; however, they often rely on strong priors or high-quality geometry reconstruction. In this paper, we propose CausalGS, a framework that learns the causal dynamics of complex dynamic 3D scenes solely from multi-view videos, while dispensing with the reliance on explicit priors. At its core is an inverse physics inference module that decouples the complex dynamics problem from the video into the joint inference of two factors: the initial velocity field representing the scene’s kinematics, and the intrinsic material properties governing its dynamics. This inferred physical information is then utilized within a differentiable physics simulator to guide the learning process in a physics-regularized manner. Extensive experiments demonstrate that CausalGS surpasses the state-of-the-art on the highly challenging task of long-term future frame extrapolation, while also exhibiting advanced performance in novel view interpolation. Crucially, our work shows that, without any human annotation, the model is able to learn the complex interactions between multiple physical properties and understand the causal relationships driving the scene’s dynamic evolution, solely from visual observations.
•中:从视频数据学习能理解物理定律并预测物体未来轨迹的物理模型是人工智能的巨大挑战。先前方法利用 PDE 作为软约束或集成物理模拟器,但依赖强先验或高质量几何重建。本文提出 CausalGS,仅从多视图视频学习复杂动态 3D 场景的因果动态,无需依赖显式先验。核心是逆物理推理模块,将复杂动态问题解耦为场景运动学的初始速度场和支配其动态的内在材料属性的联合推理。推断的物理信息在可微物理模拟器中利用以物理正则化方式指导学习过程。实验表明 CausalGS 在长期未来帧外推任务上超越 SOTA。
17. VISTA: A Generative Egocentric Video Framework for Daily Assistance
•arXiv: 2605.10579v1· PDF
Abstract / 摘要
•EN: Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data. However, capturing such scenarios in the real world is often difficult, costly, or unsafe, and physics-based simulators lack the visual fidelity needed to transfer learned behaviors to real settings. Therefore, we introduce VISTA, a video synthesis system that produces high-fidelity egocentric videos as training and evaluation data for AI agents. VISTA employs a 5-step script generation pipeline with causal reverse reasoning to create diverse, logically grounded intervention modes. These scenarios span two levels of agent autonomy: reactive and proactive. In reactive modes, the user explicitly asks the agent for help. In proactive modes, the agent offers help without receiving a direct request. We further divide proactive modes into explicit and implicit types. In explicit proactive scenarios, the user is aware of needing help but does not directly address the agent. In implicit proactive scenarios, the agent intervenes before the user even realizes that help is needed. VISTA allows users to customize and refine scenarios to generate video benchmarks for daily tasks, offering a scalable and controllable alternative to real-world data collection for training and evaluating AI agents in realistic environments.
•中:训练 AI 代理主动协助人类日常活动需要大规模视觉数据。但在现实世界捕捉此类场景往往困难、昂贵或不安全,物理模拟器缺乏转移所需视觉保真度。因此,我们引入 VISTA,一个视频合成系统,产生高保真第一人称视频作为 AI 代理的训练和评估数据。VISTA 采用 5 步脚本生成流程,利用因果反向推理创建多样、逻辑 grounded 的干预模式。场景涵盖两个代理自主性级别:反应式和主动式(显式和隐式)。VISTA 允许用户定制和细化场景生成日常任务视频基准,为现实环境中的 AI 代理训练和评估提供可扩展可控的替代方案。
18. VeloGauss: Learning Physically Consistent Gaussian Velocity Fields from Videos
•arXiv: 2605.10567v1· PDF
Abstract / 摘要
•EN: In this paper, we aim to jointly model the geometry, appearance, and physical information of 3D scenes solely from dynamic multi-view videos, without relying on any physical priors. Existing works typically employ physical losses merely as soft constraints or integrate physical simulations into neural networks; however, these approaches often fail to effectively learn complex motion physics. Although modeling velocity fields holds the potential to capture authentic physical information, due to the lack of appropriate physical constraints, current methods are unable to correctly learn the interaction mechanisms between rigid and non-rigid particles. To address this, we propose VeloGauss, designed to learn the physical properties of complex dynamic 3D scenes without physical priors. Our method learns the velocity field for each Gaussian particle by introducing a Physics Code and a Particle Dynamics System, and ultimately incorporates Global Physical Constraints to ensure the physical consistency of the scene. Extensive experiments on four public datasets demonstrate that our method outperforms achieves state-of-the-art performance in both Novel View Interpolation and Future Frame Extrapolation tasks.
•中:本文旨在仅从动态多视图视频联合建模 3D 场景的几何、外观和物理信息,不依赖任何物理先验。现有工作通常将物理损失仅作为软约束或集成物理模拟到神经网络,但往往无法有效学习复杂运动物理。为此,我们提出 VeloGauss,旨在无物理先验下学习复杂动态 3D 场景的物理属性。该方法通过引入物理代码和粒子动力学系统学习每个高斯粒子的速度场,并最终纳入全局物理约束以确保场景物理一致性。在四个公共数据集上的广泛实验表明,该方法在新视点插值和未来帧外推任务上均实现 SOTA 性能。
19. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
•arXiv: 2605.10564v1· PDF
Abstract / 摘要
•EN: End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird’s-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.
•中:端到端自动驾驶系统越来越多地整合视觉 - 语言模型(VLM)架构,但大多数方法采用的推理机制是通用领域的直接适应,缺乏针对自动驾驶场景的深入探索。本文提出一种驾驶世界模型,在鸟瞰图(BEV)空间中对连续未来帧的潜在语义特征进行并行预测,从而实现未来世界状态的长 horizon 建模。我们还引入一种高效自适应文本推理机制,利用额外社会知识和推理能力进一步提升挑战性长尾场景中的驾驶性能。提出了一种新颖、高效且有效的方法,在闭环 Bench2drive 基准上实现 SOTA 结果。
20. Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning
•arXiv: 2605.10546v1· PDF
Abstract / 摘要
•EN: Pixel-based deep reinforcement learning agents are typically trained on heavily downsampled visual observations, a convention inherited from early benchmarks rather than grounded in principled design. In this work, we show that observation resolution is a critical yet overlooked variable for policy learning: higher-resolution inputs can substantially improve both performance and generalization, provided the network architecture can process them effectively. We find that the widely used Impala encoder, which flattens spatial features into a vector, suffers from quadratic parameter growth as resolution increases and fails to leverage the additional visual detail. Replacing this operation with global average pooling, as in the Impoola architecture, decouples parameter count from resolution and yields consistent improvements across resolutions and network widths - at their respective best conditions, visual scaling unlocks a 28 % performance gain for Impoola over Impala. These gains are strongest in environments that require precise perception of small or distant objects, and gradient saliency analysis confirms that the underlying mechanism is a more spatially localized visual attention of the policy at higher resolutions. Our results challenge the prevailing practice of aggressive input downsampling and position resolution-independent architectures as a simple, effective path toward scalable visual deep RL. To facilitate future research on resolution scaling in deep RL, we publicly release the open-source code for the Procgen-HD benchmark: https://github.com/raphajaner/procgen-hd.
•中:基于像素的深度强化学习代理通常在大幅下采样的视觉观察上训练,这是继承自早期基准的惯例。本文表明观察分辨率是策略学习的关键却被忽视的变量:更高分辨率输入可显著提高性能和泛化,前提是网络架构能有效处理。发现广泛使用的 Impala 编码器在分辨率增加时遭受二次参数增长。用全局平均池化替代此操作(如 Impoola 架构)解耦参数计数与分辨率,并在各自最佳条件下实现 28% 性能增益。这些结果挑战了激进输入下采样的普遍做法,并将分辨率无关架构定位为可扩展视觉深度 RL 的简单有效路径。
(未提取到代表性大图,一般为双栏首图;已降级为略过,请见原文 PDF。)
21. SkillEvolver: Skill Learning as a Meta-Skill
•arXiv: 2605.10500v1· PDF
Abstract / 摘要
•EN: Agent skills today are static artifact: authored once – by human curation or one-shot generation from parametric knowledge – and then consumed unchanged, with no mechanism to improve from real use. We propose , a lightweight, plug-and-play solution for online skill learning, in which a single meta-skill iteratively authors, deploys, and refines domain-specific skills. The learning target of SkillEvolver is the skill’s prose and code, not model weights, so that the resulting artifact drops into any agent without retraining; and the meta-skill itself is just another skill, loaded through the same interface by any protocol-compliant CLI-agent. Unlike trace-distillation, the meta-skill refines only after deploying the learnt skill, such that the learning signal comes from failures another agent encounters while using it – not from exploratory traces alone. Refinement iterations are governed by a fresh-agent overfit audit that catches possible leakage as well as deployed-skill-specific failures, including the silent-bypass mode in which a skill appears valid in content but is never invoked at runtime. On SkillsBench tasks spanning domains, SkillEvolver reaches accuracy versus for curated human skills and for the no-skill baseline; on three GPU kernel optimization tasks from KernelBench, it also raises mean speedup from to on average.
•中:当今的代理技能是静态产物:由人类策划或参数化知识一次性生成,然后不变消耗,无机制从实际使用中改进。我们提出 SkillEvolver,一种轻量级即插即用在线技能学习解决方案,其中单个元技能迭代编写、部署和细化特定领域技能。SkillEvolver 的学习目标是技能的散文和代码而非模型权重,因此结果产物无需重新训练即可嵌入任何代理。与轨迹蒸馏不同,元技能仅在部署所学技能后细化,学习信号来自其他代理使用时的失败。在 83 个 SkillsBench 任务上,SkillEvolver 达到 56.8% 准确率,优于 43.6% 的人类策划技能。
(未提取到代表性大图,一般为双栏首图;已降级为略过,请见原文 PDF。)
22. WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
•arXiv: 2605.10434v1· PDF
Abstract / 摘要
•EN: Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into “world simulators.” Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at https://github.com/UniX-AI-Lab/WorldReasonBench/.
•中:商业视频生成系统迅速改进,加强了视频生成器可能演变为“世界模拟器”的观点。但社区仍缺乏直接测试模型能否推理观察世界应如何随时间演变的基准。我们引入 WorldReasonBench,将视频生成评估重构为世界状态预测:给定初始状态和动作,模型能否生成状态演化在物理、社会、逻辑和信息上一致的未来视频。包含 436 个精选测试案例。我们评估生成视频采用人类对齐的两部分方法:过程感知推理验证和多维度质量评估。结果暴露视觉合理性与世界推理之间的持续差距。
23. SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation
•arXiv: 2605.10376v1· PDF
Abstract / 摘要
•EN: Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.
•中:视觉 - 语言模型(VLMs)在多模态感知和语言理解方面迅速进步,但尚不清楚它们是否能可靠地将语言接地为 3D 数字环境中空间连贯、可执行的动作。我们引入 SleepWalk,一个用于评估单场景 3D 世界中指令接地轨迹预测的基准。不同于以跨房间长程探索为中心的先前导航基准,SleepWalk 针对局部、交互中心的具身推理。基准涵盖多样室内外环境,将任务组织为三个空间和时间难度级别。使用标准化点式基于法官的评估协议,评估三个前沿 VLM。结果揭示接地空间推理中的系统性失败,特别是在遮挡、交互约束和多步指令下。
24. Personal Visual Context Learning in Large Multimodal Models
•arXiv: 2605.10936v1· PDF
Abstract / 摘要
•EN: As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user’s visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.
•中:随着智能眼镜等可穿戴设备将大型多模态模型(LMMs)整合到个体用户的连续第一人称视觉流中,这些模型演变为真正个人助理的关键在于视觉个性化:推理穿戴者独特视觉信息的能力。我们将此能力形式化为个人视觉上下文学习(Personal VCL),即使用用户特定视觉上下文解决个性化查询的提示时能力。为系统评估,我们呈现 Personal-VCL-Bench,全面捕捉跨人员、物体和行为的个人视觉世界。对前沿 LMMs 的分析揭示了深刻的上下文利用差距。受此启发,我们提出代理上下文库,一个强推理时基线,将用户视觉上下文结构化为自修正记忆库并采用查询自适应证据选择。
25. CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
•arXiv: 2605.10903v1· PDF
Abstract / 摘要
•EN: This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters’ difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.
•中:本文提出一种新方法,解决预训练 VLA 模型在标准监督微调(SFT)期间往往无法有效提高性能并降低适应成本的挑战。一些具有辅助训练目标的先进微调方法可提高性能并减少收敛步数,但通常因辅助目标的额外损失而产生显著计算开销。为同时实现辅助训练增强能力和标准 SFT 的简单性,我们在参数空间内解耦辅助目标 SFT 的两个目标,即增强通用能力和拟合任务特定动作分布。为此,仅需使用两种不同训练策略在小型任务集上训练模型收敛,得到两个微调模型。两模型参数差异可解释为辅助目标提供的能力向量。这些向量与预训练参数合并形成能力增强元模型。
26. Safe Aerial 3D Path Planning for Autonomous UAVs using Magnetic Potential Fields
•arXiv: 2605.10880v1· PDF
Abstract / 摘要
•EN: Safe autonomous Uncrewed Aerial Vehicle (UAV) navigation in urban environments requires real-time path planning that avoids obstacles. MaxConvNet is a potential-field planner that leverages properties of Maxwell’s equations to generate a path to the goal without local minima. We extend the 2D MaxConvNet magnetic field planner to 3D, using a convolutional autoencoder to predict obstacle-aware potential fields from LiDAR-derived 101^3 voxel grids. Evaluation across 100 randomized closed-loop trials in two distinct Cosys-AirSim urban environments, a dense night-time cityscape and a suburban district shows a 100% path planning success rate on both maps without retraining. In offline path planning, 3DMaxConvNet produces path lengths comparable to A* on unseen maps while reducing runtime from 0.155–0.17s to 0.087–0.089s, or about 1.7–1.95 times faster than A. Against RRT(3k), 3DMaxConvNet achieves similar path quality while reducing planning runtime from 17.2–17.5s to about 0.09s, which is roughly 193–201 times faster than RRT*(3k).
•中:城市环境中安全的自主无人机(UAV)导航需要避免障碍的实时路径规划。MaxConvNet 是一种利用麦克斯韦方程属性生成无局部极小值路径的势场规划器。我们将 2D MaxConvNet 磁场规划器扩展至 3D,使用卷积自编码器从 LiDAR 衍生的 101^3 体素网格预测障碍物感知势场。在两个不同 Cosys-AirSim 城市环境中的 100 次随机闭环试验评估显示,无需重新训练即在两张地图上实现 100% 路径规划成功率。离线路径规划中,3DMaxConvNet 产生与 A* 相当的路径长度,同时将运行时间减少约 1.7-1.95 倍。相对于 RRT*(3k),3DMaxConvNet 实现类似路径质量,将规划运行时间减少约 193-201 倍。
27. MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
•arXiv: 2605.10833v1· PDF
Abstract / 摘要
•EN: Industrial anomaly detection is critical for manufacturing quality control, yet existing datasets mainly focus on static images or sparse views, which do not fully reflect continuous inspection processes in real industrial scenarios. We introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), to the best of our knowledge the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. MMVIAD contains object-centric 2-second inspection clips with approximately 120 degrees of camera motion, covering 48 object categories, 14 environments, and 6 structural anomaly types. It supports anomaly detection, defect classification, object classification, and anomaly visible-time localization. Systematic evaluations on MMVIAD show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. To improve transferable anomaly understanding, we further develop a two-stage post-training pipeline where PS-SFT (Perception-Structured Supervised Fine-Tuning) initializes perception-structured reasoning and VISTA-GRPO (Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization) refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model’s average score across the four tasks from 45.0 to 57.5, surpassing GPT-5.4. Source code is available at https://github.com/Georgekeepmoving/MMVIAD.
•中:工业异常检测对制造质量控制至关重要,但现有数据集主要关注静态图像或稀疏视图,未完全反映真实工业场景中的连续检查过程。我们引入 MMVIAD,据我们所知是首个用于工业异常检测和理解的连续多视图视频数据集,以及多任务评估基准。MMVIAD 包含以物体为中心的 2 秒检查片段,覆盖 48 个物体类别、14 个环境和 6 种结构异常类型。支持异常检测、缺陷分类、物体分类和异常可见时间定位。为改进可转移异常理解,我们进一步开发两阶段后训练流程,产生最终模型 VISTA。在 MMVIAD-Unseen 上,VISTA 将基线模型在四个任务上的平均分数从 45.0 提高到 57.5。
28. The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning
•arXiv: 2605.10828v1· PDF
Abstract / 摘要
•EN: As large language models are increasingly deployed in retrieval-augmented generation and agentic systems that accumulate extensive context, understanding how distracting information affects long-context performance becomes critical. Prior work shows that semantically relevant yet misleading documents degrade performance, but the quantitative relationship between the proportion of distractors and performance remains unstudied. In this work, we systematically vary the hard-distractor proportion in fixed-length contexts, revealing a striking nonlinear pattern: as the proportion of hard distractors increases, performance drops sharply within the first small fraction, while the remainder of the range yields only marginal additional decline. We term this ‘’The First Drop of Ink’’ effect, analogous to how a single drop of ink contaminates water. Our theoretical and empirical analyses grounded in attention mechanics show that hard distractors capture disproportionate attention even at small proportions, with diminishing marginal impact as their proportion grows. Controlled experiments further show that filtering gains mainly come from context-length reduction rather than distractor removal; substantial recovery requires reducing the hard-distractor proportion to near zero, highlighting the importance of upstream retrieval precision.
•中:随着大型语言模型越来越多地部署在检索增强生成和代理系统中,理解干扰信息如何影响长上下文性能变得至关重要。先前工作显示语义相关但误导的文档会降低性能,但干扰者比例与性能之间的定量关系尚未研究。本文系统改变固定长度上下文中的硬干扰者比例,揭示惊人的非线性模式:随着硬干扰者比例增加,性能在最初一小部分内急剧下降,而其余范围仅产生边际额外下降。我们称之为“第一滴墨水”效应。基于注意力机制的理论和实证分析表明,硬干扰者即使在比例较小时也捕获不成比例的注意力。控制实验进一步显示过滤增益主要来自上下文长度减少而非干扰者移除。
29. MaD Physics: Evaluating information seeking under constraints in physical environments
•arXiv: 2605.10820v1· PDF
Abstract / 摘要
•EN: Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget and then the agent has to infer the underlying physical law to make predictions about the state of the system in the future. MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities and highlighting directions to improve their scientific reasoning.
•中:科学发现本质上是一个资源受限过程,需要在物理和成本约束下导航测量质量和数量之间的复杂权衡。测量通过揭示新现象推动科学过程以改进理解。现有评估科学发现代理的基准专注于静态知识推理或无约束实验设计任务,未捕捉在约束下测量和规划的能力。为弥合此差距,我们提出测量与发现物理(MaD Physics),一个评估代理在测量质量和数量约束下做出信息性测量和结论能力的基准。基准包含三个环境,每个基于不同的物理定律。为减轻现有知识污染,MaD Physics 包含改变的物理定律。每次试验中,代理测量系统直到耗尽分配预算,然后推断潜在物理定律以预测系统未来状态。
30. PhyGround: Benchmarking Physical Reasoning in Generative World Models
•arXiv: 2605.10806v1· PDF
Abstract / 摘要
•EN: Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman’s rho > 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs. 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page https://phyground.github.io/.
•中:生成式世界模型越来越多地用于视频生成,期望学习模拟器捕捉支配现实世界动态的物理规则。但评估生成视频是否实际遵循这些规则仍然具有挑战性。现有物理聚焦视频基准取得重要进展,但仍面临三个关键挑战:掩盖特定法律失败的粗略评估框架、破坏注释判断有效性的响应偏差和疲劳、以及物理感知不足或难以审计的自动评估器。为解决这些挑战,我们引入 PhyGround,一个基于标准的视频生成物理推理评估基准。基准包含 250 个精选提示,每个增强预期物理结果,以及跨刚体力学、流体动力学和光学的 13 种物理定律分类法。通过大规模质量控制人类研究评估八个现代视频生成模型。为支持可复现自动评估,我们发布 PhyJudge-9B,一个开放物理专用 VLM 法官。
第二部分· 近期热点(5–7)
近期热点主题
1. VLA 模型的自适应与空间感知增强
•视觉 - 语言 - 动作(VLA)模型优化:研究聚焦于如何高效适配预训练 VLA 模型,如 PriorVLA 提出保留预训练先验的适配框架,Unified Noise Steering 利用噪声空间 RL 降低实机交互成本。
•空间感知增强:VEGA 和 CapVector 致力于解决视觉骨干缺乏 3D 几何监督的问题,通过显式空间对齐和可迁移能力向量提升空间推理精度。
2. 物理一致的世界模型与仿真评估
•物理动力学建模:HarmoWAM 和 ALAM 探索通过自适应世界动作模型和代数一致潜在过渡来建模物理动力学,而非仅依赖视频预测。
•仿真真实性评估:Is Your Driving World Model 和 WorldReasonBench 强调评估世界模型在闭环规划中的物理行为真实性,而非仅关注视觉逼真度。
3. 基于高斯溅射的 3D 动态场景重建
•多智能体协同重建:MAGS-SLAM 利用单目多智能体高斯溅射实现几何与光度一致的快速场景捕获。
•物理因果学习:CausalGS 和 VeloGauss 尝试从高斯表示中学习物理因果关系和速度场,实现无物理先验的动态场景建模。
4. 安全关键场景下的推理与规划
•反事实推理:C-CoT 引入反事实思维链,利用 VLM 在复杂城市路口进行安全规划。
•多智能体避障:Decentralized Contingency MPC 和 Safe Aerial 3D Path Planning 针对非线性多智能体系统提供去中心化的安全集和路径规划保障。
5. 长视野任务与机器人记忆基准
•记忆基准测试:RoboMemArena 填补了多模态记忆形成标注的空白,评估长视野任务中的记忆能力。
•导航与规划基准:SleepWalk 和 MDrive 分别针对指令引导的视觉语言导航和闭环合作驾驶提供了更具挑战性的评估标准。
6. 高效强化学习与技能演化
•样本效率提升:XQCfD 利用先验数据和策略加速 Actor-Critic 算法,Higher Resolution 研究证明高分辨率输入能提升 DRL 泛化性。
•技能在线演化:SkillEvolver 将技能学习视为元技能,支持在线迭代生成和细化领域特定技能。
7. 个性化视觉上下文与工业应用
•个人化视觉学习:Personal Visual Context Learning 关注可穿戴设备中针对个体用户的视觉个性化推理。
•工业异常检测:MMVIAD 和 VISTA 分别针对工业多视图视频异常检测和日常辅助生成视频框架,推动 AI 在垂直领域的落地。
第三部分· 宏观研究趋势
宏观研究趋势总结
本期论文集中反映了具身智能与自动驾驶领域从“感知驱动”向“世界模型驱动”的深刻转变。首先,Vision-Language-Action (VLA) 模型已成为通用机器人操作的核心范式,研究重点从单纯的预训练转向高效适配与空间感知增强,旨在解决预训练先验丢失和 3D 几何理解不足的问题。其次,世界模型(World Models)的评估标准正在发生质变,从关注生成视频的视觉逼真度转向验证其在闭环控制中的物理一致性和行为真实性,物理因果推理与高斯溅射等 3D 表示技术的结合成为新热点。
此外,基准测试(Benchmarking)的严谨性显著提升,针对机器人记忆、长视野规划、物理推理及多智能体协作的专用基准层出不穷,强调在部分可观测环境下的真实世界评估。最后,效率与安全性是落地关键,研究广泛探索了噪声空间强化学习、技能演化及反事实推理等方法,以在有限的实机交互预算下实现安全、高效的策略优化。展望未来,具备物理常识、长期记忆且能个性化适配的通用智能体将是主要发展方向。
更深解读与附件见我的知识星球