A big gamble after Tesla's layoffs! What is the value of FSD v12?

来源：晚点LatePost

裁员 1 万多人、大幅缩减重要项目 4680 电池团队、负责三电系统的高级副总裁等高管离职……特斯拉 CEO 埃隆·马斯克在 4 月 15 日发起的大调整，只是一个序幕。

之后半个月，特斯拉持续裁撤曾经的重点项目：4680 电池项目继续裁员、北美的超级充电桩团队完全解散、9000 吨以上一体压铸机项目被叫停，相关高管大批离职。接下来的 6 月，特斯拉还会在加州和得州两地裁员超过 6000 人。

马斯克的新赌注是全自动驾驶。无人出租车（Robotaxi）项目被拔到最高优先级。马斯克宣布将在 8 月 8 日发布产品，今年投入百亿美元采购 GPU、研发车载芯片，用于改进自动驾驶系统。他曾多次说过，只要持续迭代这套系统，就会实现无人驾驶，让特斯拉成为 10 万亿美元市值公司。

在特斯拉的第二大市场中国，马斯克也期望用这套系统翻盘。4 月底，马斯克到访中国，被政府领导接见。不久后，他在内部信中说，特斯拉已在中国获得测试部分辅助驾驶系统的许可。

今年开始大面积推送的 FSD v12 自动驾驶系统确实展现出一些不同寻常的潜力。车主的反馈都很接近：“就像人开车一样”，跟上一代相比有进步，狭路会车、超车更从容。

特斯拉 FSD v12 从容应对复杂路况。图片来自 X@Rebellionair3

自动驾驶公司元戎启行 CEO 周光今年 3 月在美国体验 FSD v12 后，承认还是低估了它的能力：“去之前我认为可能是 80 分的东西，但实际做到了 90 分。”

一家国内一线新能源企业负责人体验后相信，特斯拉的自动驾驶会有革命性突破。竞争对手们不敢错过，仅 4 月底北京车展前后，小鹏、华为、长城、商汤绝影等公司宣布将推出类似 FSD v12 的自动驾驶系统。同期，软银、英伟达和微软用 10.8 亿美元投资与特斯拉路线相同的英国自动驾驶公司 Wayve。

沿着特斯拉的路线，一场新的自动驾驶竞赛正在开启。这一次不只要解决技术难题，还是一场资源竞赛。来中国当天，马斯克在社交媒体上划出入局门槛：“任何公司，如果算力投入达不到百亿美元 ...... 就无法参与这一轮竞争。”

原理：砍掉 30 万行代码，让数据决定车怎么开

2000 年代，DARPA 在沙漠中举办的 3 场无人车挑战赛，是现代无人驾驶技术研发的源头。Google 招揽了优胜者，趟出一条可行的方案，将自动驾驶拆成多个环节：

用激光雷达、摄像头等传感器收集车辆周围环境数据，交给依赖人工标注数据训练出的模型，识别出常见的重要目标和各种障碍物（感知模块），再配合高精地图，让系统了解道路会怎么变化，最后依赖工程师用代码写成的规则决定车怎么开（预测、规划模块）。

最初，特斯拉也按照 Google 开辟的路径去做自动驾驶，为了节省成本和迅速扩大使用范围，他们开发依赖摄像头，而不是昂贵的激光雷达和高精地图的方案。推出 v12 前，特斯拉的自动驾驶系统工作流程大概是：

负责感知的视觉模块先工作，处理摄像头等传感器捕捉的路况数据，识别出路上有什么东西，大概怎么分布，哪些是动的，哪些是不动的，哪些是车道线，可以行车的区域有哪些等等。
然后是预测规划控制模块，调用感知模型处理过的信息，预测场景中人、车等动态目标接下来几秒如何行动，结合模型和工程师提前写入的规则，规划安全的行驶路线，再控制方向盘、加速或刹车踏板，照路线行驶。

特斯拉在 2021 年 AI Day 上公布的 FSD 系统架构。图片来自特斯拉

为了尽可能应对路上遇到的各种情况，特斯拉数百名工程师写了 30 万行 C++ 代码制定规则——相当于早期 Linux 操作系统代码量的 1.7 倍。

这不是人学会开车的方式，人不需要认出一条路上可能出现的大量物体到底是什么，也不需要为每个复杂场景提前制定各种规则，就能开车上路。

这样做出来的自动驾驶系统，很难保证绝对安全。真实世界千变万化，再多的工程师也难以穷尽。现在商业化的无人出租车，只能在有限区域运营，车内没有安全员，只不过是运营方把他们转移到云端，远程盯着。

直到 2021 年，在路上遇到一排交通锥，Google 无人驾驶子公司 Waymo 的无人车还是有可能停下来拒绝行驶。此时 Google 已经带着整个行业投入上千亿美元。一批公司在那两年关停已经耗资数十亿美元的无人驾驶项目。

“付出 20% 的努力，就能获得 80% 的能力。” 小鹏原自动驾驶 AI 负责人刘兰个川去年在一场学术活动上说，传统自动驾驶方案开发简单，但继续提升困难。现在他加入英伟达智能汽车团队。

特斯拉 FSD v12 学开车更像人。最大的改变是用了 “端到端” 架构：一端输入摄像头等传感器获得的数据，另一端直接输出车该怎么开。

训练这套系统时，机器从大量车辆行驶视频和人类司机在不同环境下打方向盘、踩加速踏板的数据中学习怎么开车。

在 FSD v12 中，特斯拉工程师们写的规则几乎全被消灭，30 万行规则代码仅剩 2000 多行，不到原来的 1%。

端到端自动驾驶系统学开车的方式，也只是有一点像人，并没有系统能像人类一样真的理解世界。所以人学几天，就能开车上路安全驾驶，但 FSD 要看海量的视频学习。马斯克去年在一次财报会上谈到数据有多重要：“用 100 万个视频 case 训练，勉强够用；200 万个，稍好一些；300 万个，就会感到 Wow；到了 1000 万个，就变得难以置信了。”

“传统自动驾驶系统就像一个漏斗，信息一层层丢失。” 一位自动驾驶算法工程师说，传统方案的感知阶段，工程师们通常会设置 “白名单”，重点识别行人、车辆、车道线、红绿灯等重要目标，以节省算力。到了预测规划环节，工程师会提前设定，根据需求调用感知模块输出的信息完成工作，信息会再一次折损。因此传统方案很难像人一样用充分的信息决定怎么开车，要依赖工程师们提前写的规则。

而端到端方案中，摄像头等传感器获得的信息全部传递到决策环节，“信息无损传递，模型可以从感知数据中捕捉更多信息完成决策，提升系统应对各种复杂场景的能力。” 上述自动驾驶算法工程师说，因为是端到端架构，模型的决策也会直接影响感知环节，让它之后捕捉更多人意识不到但对驾驶有帮助的数据。

在许多场景，特斯拉 FSD v12 都有明显提升。一位自动驾驶从业者（知乎 @EatElephant）告诉我们，他体验后感觉到，与 v11 相比，v12 控制车辆的速度和转向 “很丝滑”，“即使坐在后排，路口转弯时几乎感觉不到任何顿挫”。为了保证安全，传统自动驾驶方案行车时，会时不时带下刹车。

他在一篇文章中写道，面对右前方有人骑自行车的场景，“v11 会过度小心，规划出一条非常离谱的大幅绕行路线，v12 从容淡定，绕行幅度接近人类司机的选择，速度控制和果断程度也非常合理。”

那些难以用规则描述的场景，FSD v12 的应对方式有明显进步。他举例，比如遇到路边开着双闪的亚马逊送货卡车，能迅速判断对向无车，立即借道绕行。而传统的方案大多数情况下都会停下来，或者等一段时间才会考虑采取行动。

FSD v12.3 更新推送后，一批车主在 YouTube 上传了车辆从容应对各种复杂的路况的视频，比如晚间穿越拥挤的纽约第五大道，30 分钟全程没怎么碰方向盘。

面对兴奋的车主们，美国公路交通安全管理局在 5 月 6 日发函要求特斯拉详细说明，如何防止车主滥用辅助驾驶系统，比如怎么提醒驾驶员 “把手放在方向盘上”。

基础：最难的几年依然坚持预装硬件、研发芯片、采集数据

2018 年初，特斯拉深陷产能危机、面临生死考验时，马斯克发邮件给 OpenAI 管理层，希望 OpenAI 并入特斯拉，共同研发 “基于大规模神经网络训练的全自动驾驶方案”。

他认为，AI 研发需要巨资，而 OpenAI 需要建立盈利模式才能与巨头抗衡。而特斯拉已经用 Model 3 和其供应链打造了火箭的 “第一级”，如果 OpenAI 能够并入特斯拉，将会加速无人驾驶研发，打造火箭的 “第二级”，特斯拉会因此卖出更多车，OpenAI 也会有足够的收入开展人工智能研究。

马斯克的提议被拒绝，最后退出 OpenAI 董事会。但在此之前，他就已经从 OpenAI 挖来安德烈·卡帕蒂（Andrej Karpathy），负责自动驾驶技术研发，带队训练效果更强的模型。

多位自动驾驶从业者认为，卡帕蒂加入特斯拉是其研发 v12 版端到端自动驾驶模型的开端。

1986 年出生的卡帕蒂，是过去十多年人工智能浪潮的亲历者，也是从中成长起来的人工智能科学家。他 2011 在斯坦福大学读博士期间和导师李飞飞一起完善催生 AlexNet 的 ImageNet 竞赛数据集，在各个学术会议上发表数篇计算机视觉论文，在斯坦福大学开设了第一门深度学习课。博士毕业后，他是最早一批加入 OpenAI 的人。

2017 年 11 月，卡帕蒂发布著名的《软件 2.0》文章，称 “软件吞噬世界，而人工智能为基础的软件 2.0 正在吞噬软件”。那时经过大量数据训练的计算机视觉模型，识别物体的准确率超过人眼。AlphaGo 从数据中学到了击败人类围棋冠军的方法。

他相信，靠着大量数据，人工智能在大部分有价值的垂直领域，“至少在涉及图像 / 视频和声音 / 语音的领域，比你我能想出的任何代码都要好。”

在卡帕蒂到来前，特斯拉已经完成了自动驾驶的数据基建。

用大量数据训练更强的模型，是非常适合特斯拉的技术发展路线。但特斯拉要投入大量资源研发自动驾驶技术，马斯克从不缺乏冒险的决心。

2016 年开始，每一辆出厂的特斯拉汽车都搭载能运行 Autopilot 辅助驾驶系统的硬件，花钱买了软件才能开启功能。到现在也没几个汽车品牌会这么做，更常见的做法是把同一款车分成不同的版本，把搭载自动驾驶硬件车型卖给感兴趣的客户。

标配辅助驾驶的时候，特斯拉启用 “影子模式”（Shadow Mode），就算驾驶员不购买 Autopilot 功能，这套系统也会在后台运行，记录行车数据、规划行车路线。马斯克当时接受采访说，它的作用是证明系统比人可靠，为监管机构批准技术提供数据支撑。

卡帕蒂加入后，影子模式成为特斯拉获得训练模型数据的核心来源——当系统选择的路线与驾驶员的选择有明显偏差时，就会触发数据回传机制，系统会自动记录摄像头捕捉到的数据、车辆行驶数据等，等到连接 WiFi 后上传到特斯拉的服务器中。到 2018 年底，特斯拉就靠这套系统采集 16 亿公里数据，超过现在绝大多数研发自动驾驶技术的车企。

特斯拉的自动驾驶团队把大部分精力放到数据上，搭建了一套数据处理系统，专门分析、筛选收集到的数据，一开始用人、后来绝大部分数据用机器打标签，然后喂给模型，持续改进自动驾驶系统。为了用大量数据训练模型，特斯拉在 2019 年之前，就采购大量 GPU 建设名为 Dojo 的算力中心，并持续扩大，到现在已经积累了等同 3.5 万张 H100 的算力。

2019 年 4 月，特斯拉发布 HW 3.0 硬件，搭载两颗 FSD 第一代芯片，算力达到 144 TOPS，是当时英伟达车载芯片 Xavier 的近 7 倍。和过去一样，不论用户是否购买辅助驾驶系统，特斯拉都这套硬件装到每一辆特斯拉车上，而且免费帮买了辅助驾驶系统的老用户升级。

“不仅让我们能够更快地运行当前的神经网络（模型），更重要的是，它允许我们在车上部署更大、计算成本更高的模型。” 卡帕蒂说。HW 3.0 也是特斯拉现在能大规模推行 FSD v12 系统的基础。

特斯拉搭建这套基础设施的时候，也是它开始量产车辆以来资金最紧张的一段时间。从 2017 年到 2019 年初，特斯拉都深陷 Model 3 产能危机。

到 2019 年 3 月，特斯拉的现金储备只剩 22 亿美元，只够再烧不到半年。《马斯克传》记录，当时马斯克对妻子说，“我们必须筹集到资金，否则就完蛋了。”

马斯克想了几夜后，决定面向投资人办一场活动，即特斯拉 “自动驾驶日”。他告诉华尔街的投资人，无人驾驶汽车未来能帮特斯拉实现巨额盈利，接下来一年多时间会部署 100 万辆无人出租车，重塑人们的日常生活。

没人相信特斯拉的无人驾驶能很快到来。这场活动结束一个多月，特斯拉股价跌了 30%。靠着 Model 3 产能顺利扩大，上海工厂迅速建成，特斯拉才缓了过来。但接下来的 5 年，是特斯拉自动驾驶基础技术进步最快的阶段。

实现：从模拟人眼开始，一步步扩展到整个系统

看视频学开车的道理听起来很简单，但中间需要解决无数问题。

2020 年到 2022 年，特斯拉每年都会公开一版 “感知” 模型，每个版本都朝着模拟 “人眼” 更进一步。

2020 年 2 月，卡帕蒂在一场学术会议上展示了特斯拉训练 48 个神经网络组成的 “多任务模型”HydraNet，可以识别 1000 多个目标，比如汽车、自行车、车道线、学校区域等。

HydraNet 用微软亚洲研究院 2015 年发布的 ResNet 模型当主干，提取车身周边 8 个摄像头所捕捉画面的通用特征，交给不同的算法分支完成不同的任务。这么做可以避免用不同的模型重复从相同的画面提取特征，节省算力。

这是当时学术界和多数开发大型计算机识别系统公司的选择，特斯拉把它做得规模更大，并实现工程化。但这么做有局限。HydraNet 只能从不同角度的摄像头捕捉的画面中各自提取信息，摄像头可能只会捕捉到周边物体的一部分。就像新手司机很难靠后视镜流畅倒车入库一样，自动驾驶系统也很难靠它实现真正的无人驾驶，还得靠各种雷达、高精地图辅助。

不用激光雷达的卡帕蒂团队选择使用一系列算法，将 8 个不同方向的摄像头收集的画面拼成一个 360° 的鸟瞰图（Bird's Eye View，即 BEV）模式，再让模型 “理解世界”，规划行车路线。但想让这套系统效果良好，得尽量保证地面是平的，而且车周围环境要简单，否则系统就难以准确理解不同摄像头看到的图片之间有什么关联。

“当我们用它实现 FSD 时，很快发现达不到预想中的效果。” 安德烈·卡帕蒂在 2021 年特斯拉 AI Day 上说，他介绍了用 Transformer 架构开发的新版模型，能准确地把跨越多个相机的目标拼得更准确、稳定。

上部三个视角是特斯拉车载摄像头拍摄的画面。左下是传统方法拼出来的 BEV 道路图，右下是 Transformer 方法

而且利用 Transformer 架构做成的模型，输出的信息可以直接用到后续的预测规划模块，也为 FSD v12 做成端到端模型打下基础。

与新模型配合，卡帕蒂还分享了一个名叫 “Spatial RNN” 的架构，用视频训练模型时，模型能获得短暂的 “记忆” 能力，理解周围的场景如何随着时间变化，从而具备脑补摄像头视野盲区、实时构建局部地图的能力。

这次技术迭代，让特斯拉的辅助驾驶系统不用高精地图也能把车开好，再一次推高自动驾驶的能力上限，向人眼靠近。

等到 2022 年特斯拉 AI Day 举办时，卡帕蒂已经离开特斯拉。特斯拉的自动驾驶系统继续迭代，继任者阿肖克·埃卢斯瓦米（Ashok Elluswamy）介绍了 “占用网络”（Occupancy Network），在 Transformer 架构基础上引入 “高度” 要素，能把不同角度摄像头捕捉到的画面还原成 3D 场景，计算出一个物体在空间中占用多少点，从而推断出它的形状。

借助 Occupancy Network，特斯拉的自动驾驶系统不用激光雷达，只靠摄像头收集信息，就可以识别出它没有见过的障碍物，被视为 “纯视觉方案” 的胜利。

特斯拉多年研发，终于实现马斯克多年前提出的第一个要求：人靠双眼就可以识别、还原 3D 环境，车靠摄像头也应该可以。

特斯拉 Occupancy Network 识别车辆周围障碍物。图片来自特斯拉 2022 年 AI Day。

在这个过程中，特斯拉还在逐步尝试让神经网络决定车怎么开。在 2021 年的 AI Day 上，特斯拉就展示了用大量数据训练出来的 “神经网络规划模型”，当时只是作为辅助，为最终的规划决策模块提供参考。到 v12 版本，神经网络正式接管预测规划模块，完成端到端拼图。

疑问：自动驾驶现在能不能有 Scaling Laws

FSD v12 距离真正的无人驾驶还有距离。像 ChatGPT 一样，它有闪光时刻，但也常犯错。广受好评的 v12.3 版本上线后，车辆时有撞到马路牙、损坏轮毂的低级错误。而在上一代方案中，很少会出现类似的情况。

特斯拉也没敢全面依赖 v12。一位特斯拉车主从 FSD 的软件包中发现，v12 仅适用于城市街道，高速场景还是用 v11。

“端到端系统的下限其实很低。” 一位自动驾驶工程师说，高速行车速度更快，规则更简单，经过长期打磨的传统方案，可能比当前的端到端方案更安全。“只有把端到端方案下限提上来，处理简单场景比原本方案更好，才是真正的性能提升。”

端到端需要更多投入才能达到传统方案效果。图片来自小鹏原自动驾驶 AI 负责人刘兰个川在去年 CVPR 上的分享

“端到端的模型上线之前一定会有 ‘护栏’。它像是未来会成为博士的学生，但成长过程中需要小学、初中老师去带，需要时间成长。” 英伟达汽车事业部负责人吴新宙认为端到端模型成为主流之前，还需要和原有模型配合工作，保证安全。

马斯克愿意更快一点。今年 4 月，马斯克在一季度财报会上说，他们可以看到三四个月后的模型效果，已经可以称为 FSD v13，“比当前车上的版本更强，但有一些问题需要解决。”

他相信特斯拉已经找到适用于自动驾驶的 “Scaling Laws”（规模定律）：只要继续扩大模型参数、投入更多数据和算力，不断改进模型架构，就会有更好的效果。

过去多年，Scaling Laws 被视为 OpenAI 有底气开发规模更大、效果更好模型的秘诀。而自动驾驶所在的计算机视觉领域，因为训练模型需要的数据是与物理世界中相关的视频，需要模型理解更多物理规则，许多研究者担心，用更多的数据、算力训练更大的模型，会陷入瓶颈，能力不会提升，反而会下降。

“我们可以根据过去的趋势估算未来的进展，从过去的数据来看，估算通常都是正确的。” 埃卢斯瓦米在财报会上说，特斯拉每周都会训练数百个能够生成不同驾驶路线的模型，再拿从用户和测试人员那里收集的数百万视频片段测试，如果效果更好，就会给专门的路测团队和员工测试，最后推送给更多用户，迭代速度会越来越快。

我们了解到，特斯拉的 v12 系统目前并不能像 GPT-4 等语言大模型那样，可以解答训练数据中没有的问题，还要从大量的数据中学习如何应对复杂场景。

随着模型能力提升，改进模型需要的数据更多。马斯克今天说，每 10000 公里的行驶数据，只有 1 公里能训练模型。而且每训练一遍，都需要消耗大量算力。

这对于特斯拉不是问题。路上数百万辆特斯拉车可以源源不断为它提供各种各样的数据。特斯拉还在开发更强的仿真系统，生成各种数据训练模型。去年的计算机视觉学术会议 CVPR 上，埃卢斯瓦米展示了特斯拉用收集来的数据训练成的 “世界模型”（World Model），它可以根据提示词和过去的视频，生成汽车继续向前开会经历什么场景的视频，比如不同视角的摄像头中，车道线怎么延续，路口怎么变化。

但建立在端到端架构上的自动系统，是一个 “黑盒子”，就连它的创造者都搞不清楚它如何把一堆数据变成结果。人们能做的，是给它处理好的数据，让算法自己提炼规律，并依此处理新的数据。如果出了问题，就给它更多的数据，让它自己修正。

这不是自动驾驶独有的问题，任何使用深度学习的应用都一样。只是人们不那么在意抖音的算法推给你几个不感兴趣的视频，也能忍受 ChatGPT 有时 “胡说八道”，但非常在意 2 吨重的汽车为什么在道路上失常。

“它可能会 ‘无声地失败’，当问题爆发出来时，通常难以分析和排查，因为模型已经变得非常庞大。” 卡帕蒂在《软件 2.0》文章中谈到了缺陷，这会是一个选择题：“用我们理解的、效果达到 90% 的方法。还是我们不理解、效果达到 99% 的模型。”

特斯拉已经用行动做出选择。他们相信，采用端到端神经网络、经过数十亿公里现实世界数据训练的纯视觉模型，是实现大规模无人驾驶的正确方法。

马斯克给自动驾驶团队下达的命令是，想尽办法提高 FSD v12 不需要人类能够行驶的距离。他们在办公室放了一面锣，每解决一个问题，锣就会响一次。马斯克认为，只要有确凿的数据证明，自动驾驶比人开车更可靠，就不会有太多监管障碍。

过去几个月，特斯拉降低 FSD 价格、让美国的车主免费试用，激进地把 v12 版本推向市场，一个季度就行驶 5 亿公里。

从特斯拉开始研发辅助驾驶系统以来，马斯克就对无人驾驶极其乐观。2016 年，特斯拉第一次在车辆周围放置了 8 颗摄像头，拥有 360° 视角，马斯克就安排团队精心准备视频，宣扬无人驾驶即将到来。

之后每隔一两年，马斯克就会更新一次无人驾驶即将到来的时间表，然后被证明是过于乐观。但每一次，自动驾驶技术又会往前多走一步。

编辑/Jeffrey

Source: Late LatePost

The layoffs of more than 10,000 people, drastic reduction of the 4680 battery team, the senior vice president responsible for the three-power system and the departure of other executives... The major adjustments initiated by Tesla CEO Elon Musk on April 15 are just a prelude.

Half a month later, Tesla continued to abolish key projects: the 4680 battery project continued to be laid off, the North American supercharging pile team was completely disbanded, the 9,000-ton integrated die-casting machine project was halted, and a large number of relevant executives left their jobs. Next June, Tesla will lay off more than 6,000 employees in California and Texas.

Musk's new bet is fully automated driving. The Robotaxi (Robotaxi) project was given the highest priority. Musk announced that it will release the product on August 8, and this year it will invest 10 billion US dollars to purchase GPUs and develop automotive chips to improve autonomous driving systems. He has said many times that as long as the system continues to be iterated, driverless driving will be achieved, making Tesla a $10 trillion company.

In China, Tesla's second-largest market, Musk also expects to use this system to turn the market around. At the end of April, Musk visited China and was received by government leaders. Soon after, he said in an internal letter that Tesla had obtained permission to test some assisted driving systems in China.

The FSD v12 autonomous driving system, which began rolling out on a large scale this year, does show some extraordinary potential. The feedback from car owners is very close: “Just like a human driving”. Compared with the previous generation, it is more comfortable to drive and overtake cars on narrow roads.

Tesla FSD v12 gracefully handles complex road conditions. Photo by X @Rebellionair3

After experiencing FSD v12 in the US in March of this year, Zhou Guang, CEO of autonomous driving company Yuanrong Qixing, admitted that he still underestimated its ability: “Before I went, I thought it might have been 80 points, but I actually achieved 90 points.”

After experiencing it, the head of a domestic first-tier new energy company believes that Tesla's autonomous driving will have a revolutionary breakthrough. Competitors didn't dare to miss it. Just around the end of April, companies such as Xiaopeng, Huawei, Great Wall, and Shangtang Zhuiying announced that they would launch an autonomous driving system similar to FSD v12. In the same period, SoftBank, Nvidia, and Microsoft spent $1.08 billion to invest in Wayve, a British autonomous driving company on the same route as Tesla.

Following Tesla's route, a new autonomous driving race is on. This time it wasn't just about solving technical problems; it was also a resource race. On the day he came to China, Musk set the entry threshold on social media: “Any company, if it doesn't invest 10 billion dollars in computing power... I can't participate in this round of competition.”

Principle: Cut 300,000 lines of code and let data determine how to drive a car

In the 2000s, DARPA hosted 3 unmanned vehicle challenges in the desert, which were the origin of the development of modern driverless technology. Google brought in the winners and came up with a workable solution to split autonomous driving into multiple components:

Sensors such as lidar and cameras are used to collect data on the vehicle's surrounding environment, and handed over to models trained using manual labeling data to identify common important targets and various obstacles (sensing modules), then use high-precision maps to let the system understand how the road will change. Finally, it relies on rules written by engineers in code to determine how to drive the car (prediction and planning module).

Initially, Tesla also followed the path pioneered by Google for autonomous driving. In order to save costs and rapidly expand the scope of use, they developed solutions that relied on cameras rather than expensive lidars and high-resolution maps. Before launching v12, Tesla's autonomous driving system workflow was probably:

The visual module responsible for sensing works first, processes road condition data captured by sensors such as cameras, and identifies what is on the road, how it is distributed, what is moving, what is not moving, which are lane lines, what areas can be driven, etc.
Then, the predictive planning control module uses information processed by the sensing model to predict how dynamic targets such as people and cars will act in the next few seconds, combine the model and the rules written in advance by the engineer to plan a safe driving route, then control the steering wheel, accelerator, or brake pedal, and drive along the route.

Tesla unveiled the FSD system architecture at AI Day 2021. Picture from Tesla

To respond as much as possible to the various situations encountered along the way, Tesla's hundreds of engineers wrote 300,000 lines of C++ code to develop rules — equivalent to 1.7 times the amount of code in the early Linux operating system.

This is not a way for people to learn how to drive. People don't need to recognize what a large number of objects may appear on a road, and they don't need to set various rules in advance for every complicated scene to get on the road.

It is difficult to guarantee absolute safety for an autonomous driving system created in this way. The real world is ever-changing, and no matter how many engineers there are, it's hard to run out. Currently, commercialized unmanned taxis can only operate in a limited area. There are no safety personnel in the car; it's just the operator who moves them to the cloud and watches them remotely.

Until 2021, Google's driverless subsidiary Waymo's driverless cars are likely to stop and refuse to drive until 2021 when they encounter a row of traffic cones on the road. At this point, Google has invested hundreds of billions of dollars with the entire industry. In those two years, a group of companies shut down driverless projects that had already cost billions of dollars.

“If you put in 20% of the effort, you can get 80% of your abilities.” Liu Langechuan, head of autonomous driving AI at Xiaopengyuan, said at an academic event last year that traditional autonomous driving solutions are easy to develop, but it is difficult to continue improving them. Now he's on the Nvidia smart car team.

Tesla FSD v12 learns to drive more like a human. The biggest change is the use of an “end-to-end” architecture: one end inputs data obtained by sensors such as cameras, and the other end directly outputs how to drive the car.

When training this system, the machine learned how to drive from a large number of vehicle driving videos and data from human drivers hitting the steering wheel and stepping on the accelerator pedal in different environments.

In FSD v12, almost all of the rules written by Tesla engineers were eliminated, leaving only more than 2000 lines of 300,000 rule codes, less than 1% of the original.

The way an end-to-end autonomous driving system learns to drive is just a bit like a human; no system can really understand the world like a human being. Therefore, after learning for a few days, people can drive safely on the road, but FSD has to watch a lot of video learning. Musk talked about how important the data is at an earnings conference last year: “Training with 1 million video cases is barely enough; 2 million is slightly better; 3 million makes you feel Wow; when it reaches 10 million, it becomes incredible.”

“Traditional autonomous driving systems are like a funnel, and layers of information are lost.” An autonomous driving algorithm engineer said that in the sensing phase of traditional solutions, engineers usually set up a “white list” to focus on identifying important targets such as pedestrians, vehicles, lane lines, and traffic lights to save computing power. When it comes to the predictive planning process, the engineer will make settings in advance and call the information output from the sensing module to complete the work according to requirements, and the information will be damaged once again. Therefore, traditional solutions are difficult to use sufficient information to decide how to drive like humans; they rely on rules written in advance by engineers.

In the end-to-end solution, all information obtained by sensors such as cameras is transmitted to the decision-making process. “Information is transmitted without loss. The model can capture more information from perception data to complete decisions, improving the system's ability to cope with various complex scenarios.” The autonomous driving algorithm engineer mentioned above said that because it is an end-to-end architecture, the model's decisions will also directly affect the perception process, allowing it to later capture more data that people are unaware of but useful for driving.

In many scenarios, Tesla FSD v12 has improved significantly. An autonomous driving practitioner (Zhihu @EatElephant) told us that after his experience, he felt that compared to v11, the v12 controls the speed and steering of the vehicle were “very smooth”, and “even when sitting in the back row, I hardly felt any setbacks when turning at the intersection.” In order to ensure safety, traditional autonomous driving solutions use brakes from time to time when driving.

In an article, he wrote that in the face of a person riding a bicycle in front on the right, “v11 will be overly careful and plan a very outrageously large detour route. The v12 is calm and calm, the detour range is close to the choice of a human driver, and the speed control and determination are also very reasonable.”

FSD v12 has made significant progress in dealing with scenarios that are difficult to describe with rules. He gave an example, when he met an Amazon delivery truck driving double flashing on the side of the road, he was able to quickly determine that there were no cars in the opposite direction and immediately took a detour. Most of the time, traditional solutions stop, or wait a while before considering action.

After the FSD v12.3 update was released, a group of car owners uploaded videos on YouTube where the vehicle calmly coped with various complicated road conditions, such as crossing the crowded Fifth Avenue in New York at night without touching the steering wheel for 30 minutes.

Faced with excited car owners, the US Highway Traffic Safety Administration sent a letter on May 6 requesting Tesla to explain in detail how to prevent car owners from abusing driver assistance systems, such as how to remind drivers to “put their hands on the steering wheel.”

Foundation: In the most difficult years, I still insisted on pre-installing hardware, developing chips, and collecting data

At the beginning of 2018, when Tesla was mired in a production capacity crisis and faced a test of life and death, Musk sent an email to OpenAI management hoping OpenAI would be merged into Tesla to jointly develop a “fully autonomous driving solution based on large-scale neural network training.”

He believes that AI research and development requires huge capital, and OpenAI needs to establish a profit model to compete with giants. Meanwhile, Tesla has already used Model 3 and its supply chain to build the “first stage” of the rocket. If OpenAI can be integrated into Tesla, it will accelerate driverless research and development to create the “second stage” of the rocket. Tesla will sell more cars as a result, and OpenAI will also have enough revenue to carry out artificial intelligence research.

Musk's proposal was rejected and he finally left the OpenAI board. But before that, he had already taken Andrej Karpathy (Andrej Karpathy) from OpenAI to be responsible for autonomous driving technology research and development, leading a team to train more effective models.

Many autonomous driving practitioners believe that Capati's addition to Tesla is the beginning of its development of a v12 version of an end-to-end autonomous driving model.

Born in 1986, Capati is a person who experienced the wave of artificial intelligence over the past 10 years, and is also an artificial intelligence scientist who grew up from it. While studying for his PhD at Stanford University in 2011, he worked with his mentor Li Feifei to complete the ImageNet competition data set that gave birth to AlexNet, published several computer vision papers at various academic conferences, and opened the first deep learning course at Stanford University. He was one of the first people to join OpenAI after graduating from his PhD.

In November 2017, Capati published the famous “Software 2.0” article, saying “Software is eating up the world, and software 2.0 based on artificial intelligence is eating up software.” At that time, computer vision models trained with a large amount of data were more accurate than the human eye in identifying objects. AlphaGo learned from data how to beat human Go champions.

He believes that with massive amounts of data, artificial intelligence in most valuable verticals “is better than any code you can think of, at least in the fields involving image/video and sound/voice.”

Before Capati arrived, Tesla had already completed the data infrastructure for autonomous driving.

Using large amounts of data to train stronger models is an ideal technology development route for Tesla. However, Tesla has invested a lot of resources to develop autonomous driving technology, and Musk has never lacked the determination to take risks.

Beginning in 2016, every Tesla car shipped was equipped with hardware that can run the Autopilot Driver Assistance System, and it took money to buy software to activate the function. Until now, few car brands have done this. The more common practice is to divide the same car into different versions and sell models equipped with autonomous driving hardware to interested customers.

When assisted driving is standard, Tesla enables “Shadow Mode” (Shadow Mode). Even if the driver does not purchase Autopilot functions, the system will run in the background to record driving data and plan driving routes. Musk said in an interview at the time that its role is to prove that the system is more reliable than humans and provide data support for regulators to approve technology.

After Capati was added, the shadow mode became the core source for Tesla to obtain training model data — when there is a clear deviation between the route selected by the system and the driver's choice, the data return mechanism is triggered. The system automatically records data captured by the camera, vehicle driving data, etc., and is uploaded to Tesla's server after connecting to WiFi. By the end of 2018, Tesla relied on this system to collect 1.6 billion kilometers of data, surpassing the vast majority of car companies currently developing autonomous driving technology.

Tesla's autonomous driving team focused most of their energy on data and set up a data processing system to specifically analyze and screen the collected data. At first, people were used, and then most of the data was labeled with machines, then fed to models to continuously improve the autonomous driving system. In order to train models using large amounts of data, Tesla purchased a large number of GPUs to build a computing power center called Dojo before 2019, and continued to expand. Up to now, it has accumulated computing power equivalent to 35,000 H100 sheets.

In April 2019, Tesla released HW 3.0 hardware, equipped with two first-generation FSD chips, with 144 TOPS computing power, nearly 7 times that of Nvidia's automotive chip Xavier at the time. As in the past, regardless of whether users purchased an assisted driving assistance system or not, Tesla installed this hardware in every Tesla car, and helped old users who bought the assisted driving system upgrade for free.

“Not only does it allow us to run our current neural networks (models) faster, but more importantly, it allows us to deploy larger, more computationally expensive models in vehicles.” Capati said. HW 3.0 is also the foundation on which Tesla can now launch the FSD v12 system on a large scale.

When Tesla built this infrastructure, it was also the most funded period since it began mass production of vehicles. From 2017 to early 2019, Tesla was in deep trouble with Model 3 production capacity.

By March 2019, Tesla had only $2.2 billion in cash reserves, enough to burn for less than half a year. “The Biography of Musk” records that Musk said to his wife at the time, “We have to raise money or it will be over.”

After thinking about it for a few nights, Musk decided to host an event for investors, called Tesla's “Autonomous Driving Day.” He told Wall Street investors that driverless cars can help Tesla achieve huge profits in the future, and that over the next year, 1 million driverless taxis will be deployed to reshape people's daily lives.

No one believes Tesla's driverless cars will arrive anytime soon. More than a month after this campaign ended, Tesla's stock price fell 30%. Thanks to the smooth expansion of Model 3 production capacity and the rapid completion of the Shanghai factory, Tesla was able to slow down. However, the next 5 years will be the stage where Tesla's basic technology for autonomous driving will advance the fastest.

Implementation: Start by simulating the human eye and expand step by step to the entire system

Watching a video to learn how to drive sounds simple, but there are countless problems to be solved in the middle.

From 2020 to 2022, Tesla will release a version of the “perception” model every year, and each version goes one step further to simulate the “human eye.”

In February 2020, at an academic conference, Capati presented Tesla's “multi-task model” HydraNet, which trains 48 neural networks, which can identify more than 1000 targets, such as cars, bicycles, lane lines, school areas, etc.

Using the ResNet model released by Microsoft Research Asia in 2015 as the backbone, HydraNet extracts common features of images captured by 8 cameras around the car body and hands it to different branches of algorithms to complete different tasks. This avoids using different models to repeatedly extract features from the same screen, and saves computing power.

This was the choice of academia and most companies developing large-scale computer recognition systems at the time. Tesla made it larger and engineered it. But there are limitations to doing so. HydraNet can only individually extract information from images captured by cameras at different angles, and the camera may only capture part of the surrounding objects. Just as it is difficult for novice drivers to rely on rearview mirrors to smoothly reverse cars and get into storage, it is also difficult for autonomous driving systems to achieve true driverless driving, and they also have to rely on various radars and high-precision maps to assist.

The Capati team, which did not use lidar, chose to use a series of algorithms to assemble images collected by 8 cameras in different directions into a 360° bird's eye view (BEV) mode, and then let the model “understand the world” and plan driving routes. However, if you want this system to work well, you must ensure that the ground is as flat as possible, and that the surrounding environment of the car is simple; otherwise, it will be difficult for the system to accurately understand the correlation between the images seen by different cameras.

“When we used it to achieve FSD, we soon found that it didn't work as expected.” Andrea Capatti said at Tesla AI Day 2021 that he introduced a new model developed using the Transformer architecture, which can accurately spell targets across multiple cameras more accurately and stably.

The top three perspectives are images taken by Tesla's on-board camera. The bottom left is the BEV road map spelled out by traditional methods, and the bottom right is the Transformer method

Furthermore, using the model made of the Transformer architecture, the output information can be directly used in subsequent predictive planning modules, which also lays the foundation for FSD v12 to make an end-to-end model.

Cooperating with the new model, Capati also shared an architecture called “Spatial RNN”. When training the model with video, the model can obtain a brief “memory” ability to understand how the surrounding scene changes over time, so it has the ability to correct blind spots in the camera's field of view and construct local maps in real time.

This technological iteration allows Tesla's assisted driving system to drive a car well without a high-precision map, once again pushing the upper limit of autonomous driving capabilities and getting closer to the human eye.

By the time Tesla AI Day 2022 was held, Capati had already left Tesla. Tesla's autonomous driving system continues to iterate. The successor, Ashok Elluswamy (Ashok Elluswamy), introduced the “Occupancy Network” (Occupancy Network), which introduced “height” elements based on the Transformer architecture, which can restore images captured by cameras from different angles into 3D scenes, calculate how many points an object occupies in space, and thus infer its shape.

With the Occupancy Network, Tesla's autonomous driving system can identify obstacles it hasn't seen without lidar and only relies on cameras to gather information, which is seen as a victory for the “pure vision solution.”

After years of research and development, Tesla has finally realized the first requirement put forward by Musk many years ago: humans can recognize and restore a 3D environment with both eyes, and a car camera should also be fine.

Tesla Occupancy Network identifies obstacles around vehicles. Image via Tesla's 2022 AI Day.

In the process, Tesla is also gradually trying to let neural networks decide how to drive a car. On AI Day 2021, Tesla presented a “neural network planning model” trained with large amounts of data. At the time, it was only used as an aid to provide a reference for the final planning decision module. By v12, the neural network officially took over the predictive planning module and completed the end-to-end puzzle.

Question: Can autonomous driving have scaling laws now

FSD v12 is far from true driverless driving. Like ChatGPT, it has glittering moments, but mistakes are common. After the launch of the acclaimed v12.3 version, the vehicle had low-level errors such as hitting road teeth and damaging the wheels. However, in the previous generation of solutions, a similar situation rarely occurred.

Tesla didn't dare to fully rely on v12 either. A Tesla owner discovered from the FSD package that v12 only works on city streets, and v11 is still used in high-speed scenarios.

“The lower limit for an end-to-end system is actually very low.” An autonomous driving engineer said that high-speed driving is faster, the rules are simpler, and the traditional solution that has been refined over a long period of time may be safer than the current end-to-end solution. “Only by raising the lower limit of the end-to-end solution and processing simple scenarios better than the original solution can we really improve performance.”

End-to-end requires more investment to achieve the results of traditional solutions. The picture is shared by Liu Lan Gechuan, head of Xiaopengyuan Autonomous Driving AI, at last year's CVPR

“End-to-end models will definitely have 'guardrails' before they go live. It's like being a PhD student in the future, but as they grow up, they need elementary school and middle school teachers to take them, and it takes time to grow.” Wu Xinzhou, head of Nvidia's automotive division, believes that before the end-to-end model becomes mainstream, it is also necessary to work with the original model to ensure safety.

Musk wants to go a little faster. In April of this year, Musk said at the first quarter earnings conference that they can see the model results after three or four months, which can already be called FSD v13. “It is stronger than the current car version, but there are some problems that need to be solved.”

He believes Tesla has found “scaling laws” (laws of scale) applicable to autonomous driving: as long as it continues to expand model parameters, invest more data and computing power, and continuously improve the model architecture, it will have better results.

Over the years, Scaling Laws have been viewed as the secret to OpenAI's ambition to develop larger, more effective models. However, in the field of computer vision, where autonomous driving is located, because the data required for training models is videos related to the physical world, and models need to understand more physical rules, many researchers worry that using more data and computing power to train larger models will fall into a bottleneck, and their capabilities will not improve; on the contrary, they will decline.

“We can estimate future progress based on past trends. Judging from past data, the estimates are usually correct.” Eluswami said at the earnings conference that Tesla trains hundreds of models that can generate different driving routes every week, then tests millions of video clips collected from users and testers. If the results are better, it will be tested by dedicated road testing teams and employees, and finally pushed to more users, and the iteration speed will get faster and faster.

We learned that Tesla's v12 system is currently not capable of solving problems not found in training data like GPT-4, and also needs to learn how to deal with complex scenarios from large amounts of data.

As model capabilities increase, more data is needed to improve the model. Musk said today that for every 10,000 kilometers of driving data, only 1 kilometer can train the model. Moreover, every training session consumes a lot of computing power.

This isn't a problem for Tesla. The millions of Tesla cars on the road can continuously provide it with all kinds of data. Tesla is also developing a more powerful simulation system to generate various data training models. At last year's CVPR conference on computer vision, Eluswamy presented the “World Model” (World Model) that Tesla uses collected data to train. Based on reminders and past videos, it can generate videos of what scenes the car continues to meet and experience, such as how lane lines continue and how intersections change in cameras from different perspectives.

But an automated system built on an end-to-end architecture is a “black box,” and even its creators can't figure out how it can turn a bunch of data into results. What people can do is process the data for it, let the algorithm refine the rules by itself, and process the new data accordingly. If something goes wrong, give it more data and let it fix itself.

This isn't a problem unique to autonomous driving; any application using deep learning is the same. It's just that people don't care that much about Douyin's algorithm that pushes you a few uninteresting videos. They can also put up with ChatGPT's “nonsense” sometimes, but they really care about why the 2-ton car is abnormal on the road.

“It can 'fail silently, 'and when a problem breaks out, it's often difficult to analyze and troubleshoot because the model has become very large.” Capatti talked about the flaws in the “Software 2.0” article. This will be a multiple choice question: “Use the method we understand that works 90%. It's still a model that we don't understand and that works 99% of the time.”

Tesla has made a choice with action. They believe that a pure visual model using an end-to-end neural network and trained on billions of kilometers of real-world data is the right way to achieve large-scale driverless driving.

Musk's order for the autonomous driving team is to do everything possible to increase the distance FSD v12 can travel without humans. They set up a gong in the office; every time they solve a problem, the gong rings once. Musk believes that as long as there is clear data to prove that autonomous driving is more reliable than human driving, there won't be too many regulatory barriers.

Over the past few months, Tesla lowered FSD prices, gave American car owners a free trial, and aggressively brought the v12 version to the market, driving 500 million kilometers in one quarter.

Since Tesla began developing driver-assistance systems, Musk has been extremely optimistic about driverless cars. In 2016, when Tesla first placed 8 cameras around the vehicle with a 360° view, Musk arranged for the team to carefully prepare a video to promote the imminent arrival of driverless driving.

After that, every year or two, Musk would update the upcoming schedule for driverless cars, which then proved to be overly optimistic. But every time, autonomous driving technology will go one step further.

Editor/Jeffrey

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

特斯拉大裁员后的豪赌！FSD v12价值几何？