当前位置：首页>自动驾驶>Waymo发布的世界模型对自动驾驶发展的价值何在

Waymo发布的世界模型对自动驾驶发展的价值何在

2026-02-28 01:19:11

作者：孙海洋

地址：https://zhuanlan.zhihu.com/p/2008242779336221630

经授权发布，如需转载请联系原作者

2026.02.06，Waymo发布了其基于Genie3的世界模型，展示了多项核心能力。

极简的一个总结

Waymo世界模型，定义为负责生成高度逼真模拟环境的核心组件，技术方案为基于Genie3做post training的生成模型，应用场景为大规模、超写实的自动驾驶仿真。报告中通过丰富的视频demo展示了世界模型核心能力，即

图像+点云多模态数据
生成极端天气场景、安全风险场景、超长尾场景数据
可控、长时序生成（20～30s）
【小亮点】行车记录仪数据转行车数据

传递了什么信息

技术范式的拐点或许到了

世界模型方向，一直存在显示3D表达和隐式3D表达的争论，这里面对应的代表性技术分别是3dgs和video generation。

基于显示3D表达的方法，可以更方便的实现场景交互、物理反馈等世界模型非常重要的属性和能力。基于隐式3D表达的方法，需要依赖数据驱动的方式，从海量数据里学习物理因果。但隐式将注意力更加聚焦在最终的图像质量上，因此通常情况下图像真实性更好。

基于video generation方法的世界模型，生成效率是“难啃的硬骨头”，尤其是自动驾驶领域多传感器和多相机。技术报告中称“through a more efficient variant of the Waymo World Model, we can simulate longer scenes with dramatic reduction in compute while maintaining high realism （借助 Waymo World Model 的更高效变体，我们可以在大幅降低计算量的同时模拟更长的场景，并保持高真实感和高保真度）”，这将是决定是否可以在业务中真正使用的重要因素之中。

继Tesla展示了其以生成模型为主的世界模型如何在业务中使用，Waymo这次相对更加详细的展示了以生成模型为主的世界模型的能力。在闭环仿真测试和数据合成这两个自动驾驶世界模型的重要应用场景，技术范式似乎正在从3dgs重建为主的方案转变为video generation为主的方案，这个拐点或许不远了（这里主要指工业界）。

通用世界模型的降维打击

“Genie 3’s strong world knowledge, gained from its pre-training on an extremely large and diverse set of videos, allows us to explore situations that were never directly observed by our fleet”（Genie 3 通过在规模极其庞大且多样化的视频数据上进行预训练，获得了强大的世界知识，使我们能够探索那些车队从未直接观测到的情境）。

基模（即通用世界模型）的能力，决定了自动驾驶世界模型的能力边界。单纯拥有海量高质量的自动驾驶数据，是无法训练出一个出色的世界模型的。但是在一个通用世界模型上，使用一定量的自动驾驶数据后训练，就可以得到非常不错的自动驾驶世界模型。

很酷的几个场景展示

“长尾”物体的几个场景：报告中逼真的狮子、大象在马路上走路的数据，是我们两年前绞尽脑汁也搞不定的场景。

基于行车记录仪的场景生成：网络的行车记录仪拥有非常多且代价昂贵的长尾数据，如果能转化为行车数据，无论应用在评测还是训练，都具备很大价值。

疑问的几点

技术报告中仍旧有一些不清不楚的问题没有说清楚（模糊回答or无法解决），比如：

生成数据长度低：大部分的视频以3s为主，无论闭环评测还是数据合成，显然都比较短。报告的最后，展示了20-30s长序列的能力，但不确定前面更加罕见的case拓展到20s以后的效果；报告简单的具备大幅降低计算量的变种模型，但却没有提到效率究竟如何（这对于模型是否能真正落地使用，还是很关键的）。
lidar生成的逼真度：按照业内的常规做法，Waymo世界模型可能的方案，也是先生成多视角图像，然后训练一个image2lidar的模型生成点云数据。但是点云的真实度如何没有讨论，从可视化的demo看，无论场景多极端和罕见，点云都保持了比较高的质量；同时如下图所示，对于极尽处的前景目标，点云的效果表现出异常。

信息有损的问题：如果是沿用Genie3的范式，Waymo世界模型大概率是输入前N帧，以及若干condition（比如language、layout等），生成后续的序列；这种范式下，condition的提取必然存在信息损失，而评测场景十分看重场景的稳定性，这种内容的多次不一致，以及与log图像的不一致，是否影响评测的公正与准确？

技术报告原文翻译

（以及那些经验的demo视频）

【原始技术报告请看 waymo-worldmodel】

The Waymo Driver has traveled nearly 200 million fully autonomous miles, becoming a vital part of the urban fabric in major U.S. cities and improving road safety. What riders and local communities don’t see is our Driver navigating billions of miles in virtual worlds, mastering complex scenarios long before it encounters them on public roads. Today, we are excited to introduce the Waymo World Model, a frontier generative model that sets a new bar for large-scale, hyper-realistic autonomous driving simulation.

Waymo Driver 已累计行驶近 2 亿英里完全自动驾驶里程，成为美国各大城市肌理中不可或缺的一部分，并提升了道路安全。乘客和当地社区所看不到的是，我们的 Driver 在虚拟世界中已完成了数十亿英里的仿真行驶，在驶上公共道路之前便已熟练掌握各种复杂场景。今天，我们非常激动地推出 Waymo 世界模型——这一前沿生成式模型为大规模、超写实的自动驾驶仿真树立了全新标杆。

Simulation of the Waymo Driver evading a vehicle going in the wrong direction. The simulation initially follows a real event, and seamlessly transitions to using camera and lidar images automatically generated by an efficient real-time Waymo World Model.

Waymo Driver 规避一辆逆行车辆的仿真示例。该仿真最初基于一次真实事件展开，并无缝过渡到由高效实时的 Waymo World Model 自动生成的摄像头与激光雷达图像。

Simulation is a critical component of Waymo’s AI ecosystem and one of the three key pillars of our approach to demonstrably safe AI. The Waymo World Model, which we detail below, is the component that is responsible for generating hyper-realistic simulated environments.

仿真是 Waymo AI 生态系统中的关键组成部分，也是其实现“可验证安全 AI”方法论的三大支柱之一。下文将介绍的 Waymo World Model，正是负责生成高度逼真的模拟环境的核心组件。

The Waymo World Model is built upon Genie 3—Google DeepMind's most advanced general-purpose world model that generates photorealistic and interactive 3D environments—and is adapted for the rigors of the driving domain. By leveraging Genie’s immense world knowledge, it can simulate exceedingly rare events—from a tornado to a casual encounter with an elephant—that are almost impossible to capture at scale in reality. The model’s architecture offers high controllability, allowing our engineers to modify simulations with simple language prompts, driving inputs, and scene layouts. Notably, the Waymo World Model generates high-fidelity, multi-sensor outputs that include both camera and lidar data.

Waymo World Model 构建在 Genie 3 之上——这是 Google DeepMind 最先进的通用世界模型，能够生成具备照片级真实感且可交互的3D环境，并针对自动驾驶场景的严格需求进行了专门适配。借助 Genie 所蕴含的海量世界知识，该模型可以模拟极其罕见的事件——从龙卷风到偶遇大象——这些场景在现实中几乎不可能以规模化方式采集。模型架构还具备高度可控性，使工程师能够通过简单的语言提示、驾驶输入以及场景布局来修改仿真内容。值得注意的是，Waymo World Model 能够生成高保真的多传感器输出，包括摄像头和激光雷达数据。

This combination of broad world knowledge, fine-grained controllability, and multi-modal realism enhances Waymo’s ability to safely scale our service across more places and new driving environments. In the following sections we showcase the Waymo World Model in action, featuring simulations of the Waymo Driver navigating diverse rare edge-case scenarios.

这种“广泛世界知识 + 精细可控性 + 多模态真实感”的结合，大幅提升了 Waymo 在更多地区和全新驾驶环境中安全扩展服务的能力。在接下来的部分中，我们将展示 Waymo World Model 的实际表现，包括 Waymo Driver 在多种罕见长尾场景中的仿真导航示例。

Emergent Multimodal World Knowledge（涌现的多模态世界知识）

Most simulation models in the autonomous driving industry are trained from scratch based on only the on-road data they collect. That approach means the system only learns from limited experience. Genie 3’s strong world knowledge, gained from its pre-training on an extremely large and diverse set of videos, allows us to explore situations that were never directly observed by our fleet.

自动驾驶行业中的大多数仿真模型，通常仅基于各自车队采集的道路数据从零开始训练。这种方式意味着系统只能从有限的真实经验中学习。而 Genie 3 通过在规模极其庞大且多样化的视频数据上进行预训练，获得了强大的世界知识，使我们能够探索那些车队从未直接观测到的情境。

Through our specialized post-training, we are transferring that vast world knowledge from 2D video into 3D lidar outputs unique to Waymo’s hardware suite. While cameras excel at depicting visual details, lidar sensors provide valuable complementary signals like precise depth. The Waymo World Model can generate virtually any scene—from regular, day-to-day driving to rare, long-tail scenarios—across multiple sensor modalities.

通过专门设计的后训练流程，我们正将这种海量的世界知识从 2D 视频迁移到适配 Waymo 硬件体系的 3D 激光雷达输出中。摄像头擅长呈现丰富的视觉细节，而激光雷达则提供精确深度等关键互补信息。Waymo World Model 因此能够在多传感器模态下生成几乎任意场景——从日常驾驶环境到罕见的长尾极端情境。

Extreme weather conditions and natural disasters（`极端天气条件与自然灾害`）

Driving on the Golden Gate Bridge, covered in light snow. Waymo’s shadow is visible in the front camera footage.

在覆盖着薄雪的金门大桥上行驶。前置摄像头画面中可以看到Waymo的影子。

Driving out of a raging fire.

驾车驶出熊熊烈火

A suburban cul de sac completely submerged in stagnant flood water with floating furniture.

郊区的一条死胡同被洪水淹没,家具漂浮在水面上。

Encountering a tornado.

遭遇龙卷风

已关注

关注

重播分享赞

视频详情

Rare and safety-critical events（罕见且对安全至关重要的事件）

Reckless driver driving off road.

鲁莽司机驾车冲出公路

A malfunctioned truck facing the wrong way, blocking the road.

一辆故障卡车逆向行驶堵塞了道路。

Driving behind a vehicle with precariously positioned furniture on top.

跟在一辆顶上摆放着摇摇欲坠的家具的车后行驶

The leading vehicle driving into the tree branches

前方的车辆撞向了树枝。

已关注

关注

重播分享赞

视频详情

Long-tail (pun intended!) objects and more

Encounter with a friendly elephant.

与一头友好的大象相遇

Encounter with a lion.

与一头狮子相遇。

Encounter with a Texas longhorn.

与一头德州长角牛相遇。

A pedestrian dressing up as a T-rex.

一名行人打扮成霸王龙（T-rex）。

Encountering a huge tumbleweed the size of a car.

遇到一个汽车大小的巨大风滚草。

已关注

关注

重播分享赞

视频详情

Strong Simulation Controllability（强大的仿真可控性）

The Waymo World Model offers strong simulation controllability through three main mechanisms: driving action control, scene layout control, and language control.

Waymo World Model 通过三种主要机制提供强大的仿真可控能力：驾驶动作控制、场景布局控制以及语言控制。

Driving action control allows us to have a responsive simulator that adheres to specific driving inputs. This enables us to simulate “what if” counterfactual events such as whether the Waymo Driver could have safely driven more confidently instead of yielding in a particular situation.

驾驶动作控制使仿真系统能够对特定的驾驶输入做出响应并严格遵循，从而支持模拟各种“假设性”的反事实场景，例如：在某个具体情境下，Waymo Driver 是否可以更自信地继续行驶，而不是选择让行。

下面展示3dgs-based和waymo-WM的对比

Counterfactual driving. We demonstrate simulations both under the original route in a past recorded drive, or a completely new route. While purely reconstructive simulation methods (e.g., 3D Gaussian Splats, or 3DGS) suffer from visual breakdowns due to missing observations when the simulated route is too different from the original driving, the fully learned Waymo World Model maintains good realism and consistency thanks to its strong generative capabilities.

反事实驾驶仿真。我们可以在历史真实行驶记录的原始路线下进行仿真，也可以在完全全新的路线条件下进行模拟。相比之下，纯重建式仿真方法（例如 3D Gaussian Splats / 3DGS）在模拟路线与原始行驶轨迹差异较大时，往往会因为缺失观测而出现明显的视觉崩溃；而完全基于学习的 Waymo World Model 则凭借其强大的生成能力，依然能够保持良好的真实感与一致性。

已关注

关注

重播分享赞

视频详情

下面是layout control效果

Scene layout control allows for customization of the road layouts, traffic signal states, and the behavior of other road users. This way, we can create custom scenarios via selective placement of other road users, or applying custom mutations to road layouts.

场景布局控制允许对道路结构、交通信号状态以及其他交通参与者的行为进行定制。通过这种方式，我们可以有选择地放置其他交通参与者，或对道路布局进行自定义修改，从而构建特定的仿真场景。

已关注

关注

重播分享赞

视频详情

下面是Language control的效果

Language control is our most flexible tool that allows us to adjust time-of-day, weather conditions, or even generate an entirely synthetic scene (such as the long-tail scenarios shown previously).

语言控制是最灵活的工具，使我们能够通过自然语言调整一天中的时间、天气条件，甚至生成完全合成的场景（例如前文展示的各种长尾极端情境）。

已关注

关注

重播分享赞

视频详情

已关注

关注

重播分享赞

视频详情

Converting Dashcam Videos（行车记录仪视频转换）

During a scenic drive, it is common to record videos of the journey on mobile devices or dashcams, perhaps capturing piled up snow banks or a highway at sunset. The Waymo World Model can convert those kinds of videos, or any taken with a regular camera, into a multimodal simulation—showing how the Waymo Driver would see that exact scene. This process enables the highest degree of realism and factuality, since simulations are derived from actual footage.

在风景驾驶过程中，人们常会用手机或行车记录仪拍摄旅程视频，例如拍到堆积的雪堤或夕阳下的高速公路。Waymo World Model 能将这些视频，或任何普通相机拍摄的画面，转换为多模态仿真——呈现 Waymo Driver 在该场景中的真实视觉感受。由于仿真直接源自真实影像，这一过程能够实现最高程度的真实感与准确性。

已关注

关注

重播分享赞

视频详情

Scalable Inference（可扩展推理）

Some scenes we want to simulate may take longer to play out, for example, negotiating passage in a narrow lane. That’s harder to do because the longer the simulation, the tougher it is to compute and maintain stable quality. However, through a more efficient variant of the Waymo World Model, we can simulate longer scenes with dramatic reduction in compute while maintaining high realism and fidelity to enable large-scale simulations.

有些场景的仿真可能需要较长时间才能完整呈现，例如在狭窄车道中通过。长时间仿真更具挑战性，因为仿真越长，计算成本越高，同时保持稳定的质量也更困难。然而，借助 Waymo World Model 的更高效变体，我们可以在大幅降低计算量的同时模拟更长的场景，并保持高真实感和高保真度，从而支持大规模仿真。Long rollout (4x speed playback) on an efficient variant of the Waymo World Model。

已关注

关注

重播分享赞

视频详情

报告总结

By simulating the “impossible”, we proactively prepare the Waymo Driver for some of the most rare and complex scenarios. This creates a more rigorous safety benchmark, ensuring the Waymo Driver can navigate long-tail challenges long before it encounters them in the real world.

通过对“几乎不可能发生的场景”进行仿真，我们能够主动为 Waymo Driver 准备一些最罕见且复杂的情境。这为系统建立了更严格的安全基准，确保 Waymo Driver 在现实中真正遇到这些长尾挑战之前，就已经具备安全应对能力。

END

✦

大会推荐

✦

入群申请

✦

智猩猩矩阵号各有所长

点击名片即可关注

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

Emergent Multimodal World Knowledge（涌现的多模态世界知识）

Extreme weather conditions and natural disasters（`极端天气条件与自然灾害`）

下面展示3dgs-based和waymo-WM的对比