Reusing Pretrained Models by Multi-linear Operators for Efficient Training

Authors: Yu Pan, Ye Yuan, Yichun Yin, Zenglin Xu, Lifeng Shang, Xin Jiang, Qun Liu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our method can save 76% computational costs on Dei T-base transferred from Dei T-small, which outperforms bert2BERT by +12.0% and Li GO by +20.7%, respectively. In this section, we design a set of experiments to validate the proposed Mango.
Researcher Affiliation Collaboration Harbin Institute of Technology Shenzhen, Shenzhen, Guangdong, China | Pengcheng Laboratory, Shenzhen, China } Peking University, Beijing, China ~ Huawei Noah s Ark Lab, Shenzhen, Guangdong, China
Pseudocode No The paper describes 'Procedures of Applying Mango' in Section 3.2 using numbered steps in prose, but it is not a formally structured pseudocode or algorithm block.
Open Source Code No The paper does not provide any statement or link indicating that its source code is publicly available.
Open Datasets Yes We use three tiny vision Transformers (Vi Ts) [11], i.e., Dei T-T-A, Dei T-T-B, and Dei T-T-C, for growing to Dei T-S [46] on Image Net [9]... The dataset is the concatenation of English Wikipedia and Toronto Book Corpus [71]... We show the effectiveness of Mango on SQu AD and GLUE benchmark as in Table 3. To investigate the influence of Mango on transferring ability, we also conduct an experiment on downstream tasks, including CIFAR10 [26], CIFAR100 [26], Flowers [31], Cars [25], and Chest XRay8 [54].
Dataset Splits No The paper does not explicitly specify dataset split percentages (e.g., 80/10/10) or methodology for training, validation, and test sets.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments, only general mentions of training models.
Software Dependencies No The paper mentions optimizers like 'Adam' and 'Adam W' but does not specify version numbers for any software frameworks, libraries, or tools used (e.g., PyTorch 1.9).
Experiment Setup Yes We train Mango operators for 100 steps... We use Adam with learning rate 1e-3 and weight decay 1e-2 for 300 epoch optimization. The batch size is 1024. The training epoch is 40. The batch size is 768. ... The optimizer is set to Adam W. The learning rate is 1e-4 and the weight decay is 1e-2. The training epoch is 35. The batch size is 512.