Senna: Collision with the front vehicle.
Vision-language models (VLM) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM's high-level decision and E2E's low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system.
To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency.
With consistency-oriented training, Senna-2 addresses the consistency gap and produces more distinct and decision-aligned speed distributions, reflecting improved dual-system consistency and decision-following ability. Extensive experiments further demonstrate that Senna-2 significantly enhances driving safety in both open-loop and closed-loop settings.
*Relative speed ratio: the ratio between the planned speed at the 3rd second and the initial speed, reflecting the tendency of speed change of the planning trajectory.
Left: Senna
Right: Senna-2
Clip 1 : Following and Stop
Senna: Collision with the front vehicle.
Senna-2: Success.
Clip 2 : Ego Cutin
Senna: Collision with the bus.
Senna-2: Success.
Clip 3 : Cutin
Senna: Collision with another vehicle.
Senna-2: Success.
Clip 4 : Right Turn; Yield to Cyclists
Senna: Collision with the cyclist.
Senna-2: Success.
Clip 5 : Lane Change
Senna: Straddling Lanes.
Senna-2: Success.
Clip 6 : Aborted Lane Change
Senna: Collision with the rear vehicle.
Senna-2: Success.
Senna-2 is built upon several previous works:
Senna introduces the initial framework for bridging VLMs and E2E driving policies, which we extend and improve in this work.
QwenVL series provides a high-performance vision-language model that serves as the backbone for high-level decision making.
VAD series, DiffusionDrive, and ResAD offer robust E2E driving policy frameworks that we build upon for low-level planning.
RAD provides a strong training framework for closed-loop driving policy optimization.
@misc{song2026senna2aligningvlmendtoend,
title={Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning},
author={Yuehao Song and Shaoyu Chen and Hao Gao and Yifan Zhu and Weixiang Yue and Jialv Zou and Bo Jiang and Zihao Lu and Yu Wang and Qian Zhang and Xinggang Wang},
year={2026},
eprint={2603.11219},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.11219},
}