Transformer as Linear Expansion of Learngene
Authors: Shiyu Xia, Miaosen Zhang, Xu Yang, Ruiming Chen, Haokun Chen, Xin Geng
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Image Net-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch, while reducing around 2 training cost. When transferring to several downstream classification datasets, TLEG surpasses existing initialization methods by a large margin (e.g., +6.87% on i Nat 2019 and +7.66% on CIFAR-100). |
| Researcher Affiliation | Academia | Shiyu Xia, Miaosen Zhang, Xu Yang*, Ruiming Chen, Haokun Chen, Xin Geng* School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China {shiyu xia, 230228501, 101013120, 213193308, chenhaokun, xgeng}@seu.edu.cn |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Source code is available at https://github.com/Alpha Xia/TLEG. |
| Open Datasets | Yes | We conduct experiments on Image Net-1K (Deng et al. 2009) and several middle/small-scale datasets including i Naturalist 2019 (i Nat 19) (Zhou et al. 2020), Mini-Image Net (Mi-INet) (Vinyals et al. 2016), Tiny-Image Net (Ti INet) (Le and Yang 2015), CIFAR-10 (C-10), CIFAR-100 (C-100) (Krizhevsky, Hinton et al. 2009) and Food-101 (F-101) (Bossard, Guillaumin, and Van Gool 2014). |
| Dataset Splits | No | The paper mentions using standard datasets like ImageNet-1K, CIFAR-10, CIFAR-100, etc., but does not explicitly provide the train/validation/test split percentages or sample counts within the text. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory) used for running the experiments are provided in the paper. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers. |
| Experiment Setup | Yes | For Aux-S and Des-S of 10 different depths, we train Aux-S for 150 epochs and each Des-S for 35 epochs, except that we train 11-layer Des-S for 45 epochs. ... For Aux-B and Des-B of 10 different depths, we train Aux-B for 100 epochs and each Des-B for 40 epochs. ... For Aux-Ti and Des-Ti of 4 different depths, we train Aux-Ti for 150 epochs and each Des-Ti for 50 epochs. ... we introduce one distillation loss: LD = KL(ϕ(zs/τ), ϕ(zt/τ)), ... our total training loss is defined as: L = (1 λ)CE(ϕ(zs), y) + λLD |