Motion planning is a challenging task to generate safe and feasible trajectories in highly dynamic and complex environments, forming a core capability for autonomous vehicles. In this paper, we propose DRAMA, the first Mamba-based end-to-end motion planner for autonomous vehicles. DRAMA fuses camera, LiDAR Bird's Eye View images in the feature space, as well as ego status information, to generate a series of future ego trajectories. Unlike traditional transformer-based methods with quadratic attention complexity for sequence length, DRAMA is able to achieve a less computationally intensive attention complexity, demonstrating potential to deal with increasingly complex scenarios. Leveraging our Mamba fusion module, DRAMA efficiently and effectively fuses the features of the camera and LiDAR modalities. In addition, we introduce a Mamba-Transformer decoder that enhances the overall planning performance. This module is universally adaptable to any Transformer-based model, especially for tasks with long sequence inputs. We further introduce a novel feature state dropout which improves the planner's robustness without increasing training and inference times. Extensive experimental results show that DRAMA achieves higher accuracy on the NAVSIM dataset compared to the baseline Transfuser, with fewer parameters and lower computational costs.
Pipeline.
DRAMA combines camera and LiDAR BEV images in feature space using the Mamba Fusion module. The final fusion feature is concatenated with the ego status and passed into the decoder, which outputs a deterministic trajectory for AV navigation using multiple Mamba-Transformer decoder layers.Multi-scale Convolution.
The multi-scale convolution is employed to capture image features from different scales.Mamba Fusion.
The image and LiDAR BEV features are initially reshaped and concatenated, followed by fusion through the Mamba module. After the processing by Mamba, the fused features are split and reshaped to their original dimensions.Feature State Dropout.
The fusion feature and state feature are concatenated and augmented with a learnable positional embedding. The combined features are then processed using the differentiated dropout policy to selectively drop features.Mamba Transformer Decoder.
The query undergoes processing within the Mamba module and then engages cross-attention with the key and value. This module provids a viable alternative to traditional Transformer decoder, particularly for processing long sequences.Visualizations of the planning results of DRAMA across various scenarios.
(a) Yielding to pedestrians (b) Lane changing to overtake (c) Lane changing on a curve (d) Entering the parking area
(e) Waiting at a red light (f) Exiting the parking area (g) Turning at an intersection (h) Turning while following a vehicle
@misc{yuan2024drama,
title={DRAMA: An Efficient End-to-end Motion Planner for Autonomous Driving with Mamba},
author={Chengran Yuan and Zhanqi Zhang and Jiawei Sun and Shuo Sun and Zefan Huang and Christina Dao Wen Lee and Dongen Li and Yuhang Han and Anthony Wong and Keng Peng Tee and Marcelo H. Ang Jr au2},
year={2024},
eprint={2408.03601},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2408.03601},
}