Self-Supervised Bird’s Eye View Motion Prediction with Cross-Modality Signals

Authors: Shaoheng Fang, Zuhong Liu, Mingyu Wang, Chenxin Xu, Yiqi Zhong, Siheng Chen

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations conducted on the nu Scenes (Caesar et al. 2020) dataset demonstrate that our proposed methodology improves upon previous self-supervised approaches by up to 40%. Notably, our method achieves performance comparable to weakly-supervised and fully-supervised methods.
Researcher Affiliation Academia Shaoheng Fang1, Zuhong Liu1, Mingyu Wang2, Chenxin Xu1, Yiqi Zhong3, Siheng Chen1,4 1 Shanghai Jiao Tong University 2 University of Chinese Academy of Sciences 3 University of Southern California 4 Shanghai AI Laboratory
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code, such as a repository link or an explicit statement about code release.
Open Datasets Yes We evaluate our approach on the Nu Scenes (Caesar et al. 2020) dataset. Nu Scenes contains 1000 scenes, each of which has 20 seconds of Li DAR point cloud sequences and multi-view camera videos annotated at 2Hz.
Dataset Splits Yes Following the setting in previous works for fair comparisons (Wu, Chen, and Metaxas 2020; Wang et al. 2022; Luo, Yang, and Yuille 2021; Li et al. 2023; Jia et al. 2023), we adopt 500 scenes for training, 100 scenes for validation, and 250 scenes for testing.
Hardware Specification Yes All models are trained on four NVIDIA 3090 GPUs with a batch size of 64.
Software Dependencies No The paper mentions "we employ (Teed and Deng 2020) as the optical flow estimation model with the pretrained parameters offered by Pytorch," but it does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes The input point clouds are cropped within a range of [ 32, 32] [ 32, 32] [ 3, 2] meters, and the BEV output map is 256 256 in size... The static/dynamic classification thresholds in eq.5 are τ 2D = 5pixels and τ 3D = 1m... For the training loss in eq.10, we set λmc = 1, λpr = 0.1 and λtc = 0.4. We employ Adam W (Loshchilov and Hutter 2017) optimization algorithm for training... We train the model for 100 epochs with an initial learning rate of 0.008, and we decay the learning rate by 0.5 every 20 epochs.