Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning

Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning

¹Huazhong University of Science & Technology, ²Horizon Robotics

^†Project Lead, ^✉Corresponding Author.

Abstract

Vision-language models (VLM) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM's high-level decision and E2E's low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system.

To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency.

With consistency-oriented training, Senna-2 addresses the consistency gap and produces more distinct and decision-aligned speed distributions, reflecting improved dual-system consistency and decision-following ability. Extensive experiments further demonstrate that Senna-2 significantly enhances driving safety in both open-loop and closed-loop settings.

*Relative speed ratio: the ratio between the planned speed at the 3rd second and the initial speed, reflecting the tendency of speed change of the planning trajectory.

Acknowledgements

Senna-2 is built upon several previous works:

Senna introduces the initial framework for bridging VLMs and E2E driving policies, which we extend and improve in this work.

QwenVL series provides a high-performance vision-language model that serves as the backbone for high-level decision making.

VAD series, DiffusionDrive, and ResAD offer robust E2E driving policy frameworks that we build upon for low-level planning.

RAD provides a strong training framework for closed-loop driving policy optimization.

Citation

@misc{song2026senna2aligningvlmendtoend, title={Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning}, author={Yuehao Song and Shaoyu Chen and Hao Gao and Yifan Zhu and Weixiang Yue and Jialv Zou and Bo Jiang and Zihao Lu and Yu Wang and Qian Zhang and Xinggang Wang}, year={2026}, eprint={2603.11219}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2603.11219}, }

Senna-2

Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning

Abstract

Framework

Senna vs Senna-2: Closed-Loop Comparison

Acknowledgements

Citation