Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception

Authors: Xiao Wang, Wentao Wu, Chenglong Li, Zhicheng Zhao, Zhe Chen, Yukai Shi, Jin Tang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our Vehicle MAE.
Researcher Affiliation Academia 1 Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei 230601, China 2 Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, Hefei 230601, China 3 School of Computer Science and Technology, Anhui University, Hefei 230601, China 4 School of Artificial Intelligence, Anhui University, Hefei 230601, China 5 School of Computing, Engineering and Mathematical Sciences, La Trobe University 6 School of Information Engineering, Guangdong University of Technology, Guangzhou, China
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code and pretrained models will be released at https://github.com/Event AHU/Vehicle MAE.
Open Datasets Yes It contains 1026394 vehicle images from diverse scenarios and sources, including existing vehicle dataset Comp Cars (Yang et al. 2015) and VERI-Wild (Lou et al. 2019). There are 732112 surveillance images and 294282 network images. These images fully reflect the key features of vehicles, such as illumination, motion blur, viewpoints, and occlusion. In our four different downstream tasks, three datasets are adopted for the downstream validation, including the Ve Ri dataset (Liu et al. 2016), Stanford Cars dataset (Krause et al. 2013), and Part Image Net dataset (He et al. 2022a).
Dataset Splits No The paper mentions using datasets for 'downstream validation' and states 'More details about these datasets can be found in our supplementary materials', but does not explicitly provide specific training/validation/test splits (percentages or sample counts) in the main text of the paper.
Hardware Specification Yes A server with four RTX3090 GPUs is used for the pre-training.
Software Dependencies No All the experiments are implemented using Python based on the deep learning toolkit Py Torch (Paszke et al. 2019).
Experiment Setup Yes In our pre-training phase, the learning rate is set as 0.00025, and the weight decay is 0.04. The Adam W (Loshchilov and Hutter 2018) is selected as the optimizer to train our model. The batch size is 512 and training for a total of 100 epochs on our Autobot1M dataset. The tradeoff parameters between various loss functions are set as 4, 0.02,0.02, 2, and 0.1, respectively.