Linearly Decomposing and Recomposing Vision Transformers for Diverse-Scale Models
Authors: Shuxia Lin, Miaosen Zhang, Ruiming Chen, Xu Yang, Qiufeng Wang, Xin Geng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are used to validate the effectiveness of our method: Vi Ts can be decomposed and the decomposed learngenes can be recomposed into diverse-scale Vi Ts, which can achieve comparable or better performance compared to traditional model compression and pre-training methods. The code for our experiments is available in the supplemental material. |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, China {shuxialin, 230228501, 220232251, qfwang, xuyang_palm, xgeng}@seu.edu.cn |
| Pseudocode | No | The paper describes its method through textual descriptions and a pipeline diagram (Figure 3). It does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps. |
| Open Source Code | Yes | The code for our experiments is available in the supplemental material. |
| Open Datasets | Yes | To train each learngene, we use Image Net-1K [12], which contains approximately 1.2M training images across 1000 classes and 50K validation images. After recomposing Vi T models of different layers with learngenes, we adapt them on 9 diverse downstream datasets, which include 3 object classification tasks: CIFAR-10 [26], CIFAR-100 [26], and Tiny-Image Net [1]; 5 fine-grained classification tasks: i Naturalist-2019 [66], Food-101 [4], Oxford Flowers-102 [35], Stanford Cars [25], and Oxford-IIIT Pets [38]; 1 texture classification task: DTD [10]. |
| Dataset Splits | Yes | To train each learngene, we use Image Net-1K [12], which contains approximately 1.2M training images across 1000 classes and 50K validation images. |
| Hardware Specification | Yes | The decomposition process is trained over 500 epochs on four NVIDIA RTX 3090 GPUs, and the recomposed models are trained over 100 epochs on two NVIDIA RTX 3090 GPUs for each downstream task. |
| Software Dependencies | No | We implement the model using Py Torch [39] and the Timm library [53]. The paper mentions these software components but does not provide specific version numbers for them. |
| Experiment Setup | Yes | Batch Size: We employ distributed training across 4 GPUs, with each GPU handling 128 data instances, resulting in an overall batch size of 512. Optimizer: The training of each learngene is optimized using Adam W, with an initial learning rate of 0.0008 and a weight decay of 0.05. Learning Rate Schedule: We apply a cosine learning rate decay, with a warm-up period of 5 epochs. (Similar details are provided for Recomposition training settings). |