DRAMA: An Efficient End-to-end Motion Planner for Autonomous Driving with Mamba

Chengran Yuan¹, Zhanqi Zhang¹, Jiawei Sun¹, Shuo Sun¹, Zefan Huang¹, Christina Dao Wen Lee¹, Dongen Li¹, Yuhang Han¹, Anthony Wong², Keng Peng Tee², and Marcelo H. Ang Jr.¹

¹National University of Singapore, ²Moovita Pte Ltd

Abstract

Motion planning is a challenging task to generate safe and feasible trajectories in highly dynamic and complex environments, forming a core capability for autonomous vehicles. In this paper, we propose DRAMA, the first Mamba-based end-to-end motion planner for autonomous vehicles. DRAMA fuses camera, LiDAR Bird's Eye View images in the feature space, as well as ego status information, to generate a series of future ego trajectories. Unlike traditional transformer-based methods with quadratic attention complexity for sequence length, DRAMA is able to achieve a less computationally intensive attention complexity, demonstrating potential to deal with increasingly complex scenarios. Leveraging our Mamba fusion module, DRAMA efficiently and effectively fuses the features of the camera and LiDAR modalities. In addition, we introduce a Mamba-Transformer decoder that enhances the overall planning performance. This module is universally adaptable to any Transformer-based model, especially for tasks with long sequence inputs. We further introduce a novel feature state dropout which improves the planner's robustness without increasing training and inference times. Extensive experimental results show that DRAMA achieves higher accuracy on the NAVSIM dataset compared to the baseline Transfuser, with fewer parameters and lower computational costs.

BibTeX

@article{yuan2024drama, title={Drama: An efficient end-to-end motion planner for autonomous driving with mamba}, author={Yuan, Chengran and Zhang, Zhanqi and Sun, Jiawei and Sun, Shuo and Huang, Zefan and Lee, Christina Dao Wen and Li, Dongen and Han, Yuhang and Wong, Anthony and Tee, Keng Peng and others}, journal={arXiv preprint arXiv:2408.03601}, year={2024} }

DRAMA: An Efficient End-to-end Motion Planner for Autonomous Driving with Mamba

An overview of the DRAMA pipeline

Abstract

Pipeline.
DRAMA combines camera and LiDAR BEV images in feature space using the Mamba Fusion module. The final fusion feature is concatenated with the ego status and passed into the decoder, which outputs a deterministic trajectory for AV navigation using multiple Mamba-Transformer decoder layers.

Multi-scale Convolution.
The multi-scale convolution is employed to capture image features from different scales.

Mamba Fusion.
The image and LiDAR BEV features are initially reshaped and concatenated, followed by fusion through the Mamba module. After the processing by Mamba, the fused features are split and reshaped to their original dimensions.

Feature State Dropout.
The fusion feature and state feature are concatenated and augmented with a learnable positional embedding. The combined features are then processed using the differentiated dropout policy to selectively drop features.

Mamba Transformer Decoder.
The query undergoes processing within the Mamba module and then engages cross-attention with the key and value. This module provids a viable alternative to traditional Transformer decoder, particularly for processing long sequences.

BibTeX

DRAMA: An Efficient End-to-end Motion Planner for Autonomous Driving with Mamba

An overview of the DRAMA pipeline

Abstract

Pipeline. DRAMA combines camera and LiDAR BEV images in feature space using the Mamba Fusion module. The final fusion feature is concatenated with the ego status and passed into the decoder, which outputs a deterministic trajectory for AV navigation using multiple Mamba-Transformer decoder layers.

Multi-scale Convolution. The multi-scale convolution is employed to capture image features from different scales.

Mamba Fusion. The image and LiDAR BEV features are initially reshaped and concatenated, followed by fusion through the Mamba module. After the processing by Mamba, the fused features are split and reshaped to their original dimensions.

Feature State Dropout. The fusion feature and state feature are concatenated and augmented with a learnable positional embedding. The combined features are then processed using the differentiated dropout policy to selectively drop features.

Mamba Transformer Decoder. The query undergoes processing within the Mamba module and then engages cross-attention with the key and value. This module provids a viable alternative to traditional Transformer decoder, particularly for processing long sequences.

BibTeX

Pipeline.
DRAMA combines camera and LiDAR BEV images in feature space using the Mamba Fusion module. The final fusion feature is concatenated with the ego status and passed into the decoder, which outputs a deterministic trajectory for AV navigation using multiple Mamba-Transformer decoder layers.

Multi-scale Convolution.
The multi-scale convolution is employed to capture image features from different scales.

Mamba Fusion.
The image and LiDAR BEV features are initially reshaped and concatenated, followed by fusion through the Mamba module. After the processing by Mamba, the fused features are split and reshaped to their original dimensions.

Feature State Dropout.
The fusion feature and state feature are concatenated and augmented with a learnable positional embedding. The combined features are then processed using the differentiated dropout policy to selectively drop features.

Mamba Transformer Decoder.
The query undergoes processing within the Mamba module and then engages cross-attention with the key and value. This module provids a viable alternative to traditional Transformer decoder, particularly for processing long sequences.