Task-Customized Self-Supervised Pre-training with Scalable Dynamic Routing
Authors: Zhili LIU, Jianhua Han, Lanqing Hong, Hang Xu, Kai Chen, Chunjing Xu, Zhenguo Li1854-1862
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results show that our SDR can train 256 sub-nets on Image Net simultaneously, which provides better transfer performance than a unified model trained on the full Image Net, achieving state-of-the-art (SOTA) averaged accuracy over 11 downstream classification tasks and AP on PASCAL VOC detection task. In this section, we apply the proposed SDR to train SDRnet and a series of sub-nets. We demonstrate the effectiveness of SDR by evaluating the resulting pre-trained models on various downstream tasks including classification and detection. We also take ablation studies on the number of sub-nets, training time and the distillation method as shown in Sec. 5.3. |
| Researcher Affiliation | Collaboration | Zhili Liu1,2, Jianhua Han2, Lanqing Hong2, Hang Xu2, Kai Chen1, Chunjing Xu2, Zhenguo Li2 1 Department of Computer Science and Engineering, Hong Kong University of Science and Technology 2 Huawei Noah s Ark Lab {zhili.liu, kai.chen}@connect.ust.hk, {hanjianhua4, honglanqing, xu.hang, xuchunjing, li.zhenguo}@huawei.com |
| Pseudocode | Yes | Refer to Algorithm 1 in Appendix D for the entire training procedure. |
| Open Source Code | No | No explicit statement or link for open-source code for the methodology is provided in the paper. |
| Open Datasets | Yes | We deliberately split the Image Net into two disjoint subsets, namely Subset-A and Subset-B, based on their semantic dissimilarity in Word Net Tree (Miller 1998). We adopt Image Net as the dataset for self-supervised pretraining without using labels. |
| Dataset Splits | Yes | The quality of the pre-trained representations is evaluated by training a supervised linear classifier upon the frozen representations in the training set, and then testing it in the validation set. |
| Hardware Specification | Yes | On Food101 (Bossard, Guillaumin, and Van Gool 2014), for example, it takes 20 minutes with 8*V100 to decide the best route. which takes about 15 minutes on 8*V100 for each model. |
| Software Dependencies | No | No specific software dependencies with version numbers are mentioned in the paper. |
| Experiment Setup | Yes | The training epochs for the three models are the same. Model configuration. We apply the SDR block in all four stages of Res Net. In each stage, all blocks have four individual groups and one shared group. The size of shared groups is half of all groups. All blocks in same stage perform identically. So we can generate 44 = 256 different sub-nets. For comparison, we enlarge our model so that the size of each sub-net is close to that of Res Net-50 (He et al. 2016), the most commonly used backbone in SSL. For deployment, we reset each sub-net with the corresponding batch normalization (BN) statistics in pre-training following (Cai et al. 2019). We adopt Image Net as the dataset for self-supervised pretraining without using labels. We use Sim Siam (Chen and He 2021) and BYOL (Grill et al. 2020) as our baseline models. See more experimental details and hyper-parameters in Appendix A. |