Transformer as Linear Expansion of Learngene

Authors: Shiyu Xia, Miaosen Zhang, Xu Yang, Ruiming Chen, Haokun Chen, Xin Geng

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Image Net-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch, while reducing around 2 training cost. When transferring to several downstream classification datasets, TLEG surpasses existing initialization methods by a large margin (e.g., +6.87% on i Nat 2019 and +7.66% on CIFAR-100).
Researcher Affiliation Academia Shiyu Xia, Miaosen Zhang, Xu Yang*, Ruiming Chen, Haokun Chen, Xin Geng* School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China {shiyu xia, 230228501, 101013120, 213193308, chenhaokun, xgeng}@seu.edu.cn
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Source code is available at https://github.com/Alpha Xia/TLEG.
Open Datasets Yes We conduct experiments on Image Net-1K (Deng et al. 2009) and several middle/small-scale datasets including i Naturalist 2019 (i Nat 19) (Zhou et al. 2020), Mini-Image Net (Mi-INet) (Vinyals et al. 2016), Tiny-Image Net (Ti INet) (Le and Yang 2015), CIFAR-10 (C-10), CIFAR-100 (C-100) (Krizhevsky, Hinton et al. 2009) and Food-101 (F-101) (Bossard, Guillaumin, and Van Gool 2014).
Dataset Splits No The paper mentions using standard datasets like ImageNet-1K, CIFAR-10, CIFAR-100, etc., but does not explicitly provide the train/validation/test split percentages or sample counts within the text.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or memory) used for running the experiments are provided in the paper.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers.
Experiment Setup Yes For Aux-S and Des-S of 10 different depths, we train Aux-S for 150 epochs and each Des-S for 35 epochs, except that we train 11-layer Des-S for 45 epochs. ... For Aux-B and Des-B of 10 different depths, we train Aux-B for 100 epochs and each Des-B for 40 epochs. ... For Aux-Ti and Des-Ti of 4 different depths, we train Aux-Ti for 150 epochs and each Des-Ti for 50 epochs. ... we introduce one distillation loss: LD = KL(ϕ(zs/τ), ϕ(zt/τ)), ... our total training loss is defined as: L = (1 λ)CE(ϕ(zs), y) + λLD