Effective Adaptation in Multi-Task Co-Training for Unified Autonomous Driving

Authors: Xiwen Liang, Yangxin Wu, Jianhua Han, Hang Xu, Chunjing XU, Xiaodan Liang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we extensively investigate the transfer performance of various types of self-supervised methods, e.g., Mo Co and Sim CLR, on three downstream tasks, including semantic segmentation, drivable area segmentation, and traffic object detection, on the large-scale driving dataset BDD100K. Our experiments demonstrate that the adapt stage significantly improves the overall performance of those off-the-shelf pretrained models and the contextual features generated by LV-Adapter are of general benefits for downstream tasks.
Researcher Affiliation Collaboration Xiwen Liang1 , Yangxin Wu1 , Jianhua Han2, Hang Xu2, Chunjing Xu2, Xiaodan Liang1 1Shenzhen Campus of Sun Yat-Sen University, 2Huawei Noah s Ark Lab {liangxw29@mail2, wuyx29@mail2, liangxd9@mail}.sysu.edu.cn, {hanjianhua4, xu.hang, xuchunjing}@huawei.com
Pseudocode No The paper includes figures illustrating model architectures and paradigms, but no structured pseudocode or algorithm blocks.
Open Source Code No Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] Code will be released.
Open Datasets Yes We surprisingly find that their performances are sub-optimal or even lag far behind the single-task baseline, which may be due to the distinctions of training objectives and architectural design lied in the pretrain-finetune paradigm. To overcome this dilemma as well as avoid redesigning the resource-intensive pre-training stage, we propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training, where the off-the-shelf pretrained models can be effectively adapted without increasing the training overhead. During the adapt stage, we utilize learnable multi-scale adapters to dynamically adjust the pretrained model weights supervised by multi-task objectives while leaving the pretrained knowledge untouched. Furthermore, we regard the vision-language pre-training model CLIP as a strong complement to the pretrain-adapt-finetune paradigm and propose a novel adapter named LV-Adapter, which incorporates language priors in the multi-task model via task-specific prompting and alignment between visual and textual features. Our experiments demonstrate that the adapt stage significantly improves the overall performance of those off-the-shelf pretrained models and the contextual features generated by LV-Adapter are of general benefits for downstream tasks.
Dataset Splits Yes We introduce three settings, i.e., the Disjoint-normal, Disjoint-balance, and Full-setting, corresponding to different levels of annotation quantity in our experiments. For clarity, we defer the detailed descriptions of three settings to Section 5.1. For consistency, we conduct evaluations on the same validation set of BDD100K for different settings above. We mainly work under the Disjoint-normal setting for comparisons between the performance of different pre-training methods as well as ablation studies. And we further verify the efficacy of our proposed methods on the Disjoint-balance and the Full-setting in Section 5.2. In BDD100K, 70k training images are labeled for both object detection and drivable area segmentation, and only 7k training images are labeled for semantic segmentation.
Hardware Specification Yes All experiments are conducted on servers with 8 Nvidia V100 GPU (32GB) cards and Intel Xeon Platinum 8168 CPU (2.70GHz).
Software Dependencies No No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow) are explicitly stated. Mind Spore is mentioned without a version.
Experiment Setup Yes The training epoch is fixed as 36, sync BN [28] is on, the learning rate at the fine-tuning stage is 2.5 10 5, and weight decay is 1 10 4. The image scale is 1280 (720, 600). No other data augmentation is used. We adopt the Adam W [27] optimizer. The warmup iteration is 1000 and the warmup factor is 0.01. For LV-Adapter, the learnable prompts are prepended to the class, and the length of prompts is 16. During the adaptation stage, the learning rate is set to 2.5 10 4. The layer of transformer decoder in Equation 5 is 3.