Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
LOIRE: LifelOng learning on Incremental data via pre-trained language model gRowth Efficiently
Authors: Xue Han, Yitong Wang, Junlan Feng, wenchun.gao, Qian Hu, Chao Deng
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 EXPERIMENTS 3.1 EXPERIMENTAL SETUP 3.2 RESULTS AND ANALYSIS We design a set of experiments to validate LOIRE. To begin, we evaluate LOIRE s performance on the pre-training and downstream tasks. We also evaluate the function-preserving effect using the pre-training findings. Next, we compare FLOPS and wall time costs to determine training efficiency. Finally, we conducted ablation studies to investigate the separate impact of growth operators, schedules, and distillation. |
| Researcher Affiliation | Industry | Xue Han , Yitong Wang , Junlan Feng , Wenchun Gao, Qian Hu & Chao Deng JIUTIAN Team China Mobile Research Institute Beijing, China EMAIL |
| Pseudocode | Yes | C ALGORITHM Algorithm 1 summarizes LOIRE for growing Transformer in lifelong learning. |
| Open Source Code | No | No explicit statement or link is provided for the authors' own methodology (LOIRE). The only code reference is for a baseline method: "Li GO(Wang et al., 2023) is an efficient data-driven method... We re-produce this work leveraging the LIGO open-source code3 to first train from scratch using the Redpajama dataset." with footnote 3 pointing to "https://vita-group.github.io/Li GO/" |
| Open Datasets | Yes | 3.1 EXPERIMENTAL SETUP Pre-training Dataset. We use five different domain datasets for growth pre-training, including the combination of Wikipedia & Book Corpus (WB)(Zhu et al., 2015), Realnews(NEWS)(Zellers et al., 2019), Amazon Reviews (REV)(He & Mc Auley, 2016), Biomedical papers (BIO)(Lo et al., 2019), and Computer science papers (CS)(Lo et al., 2019), publicly available on Hugging Face Hub1. In each domain, we sample out 10 GB of data and divide it into pre-training and recovery memory data for distillation in a 9:1 ratio. Fig. 2 shows the correlations among these five datasets. We also use Redpajama(Weber et al., 2025) to generate initial version models for further evaluation, and then use the five domain datasets to proceed with training during lifelong learning. |
| Dataset Splits | Yes | In each domain, we sample out 10 GB of data and divide it into pre-training and recovery memory data for distillation in a 9:1 ratio. |
| Hardware Specification | Yes | Our setup consists of a four-core CPU and eight NVIDIA Tesla A100 GPUs. |
| Software Dependencies | No | The paper mentions "Adam is chosen as the optimizer." but does not provide specific version numbers for any software components. |
| Experiment Setup | Yes | For hyper-parameters, we linearly increase the λi λ value in the layer growth operator from 0 to 1 in 5000 steps per growth stage. During the iterative distillation in the warmup period, the best β β is set to 0.1 and vanishes after 1,000 steps. Adam is chosen as the optimizer. Our setup consists of a four-core CPU and eight NVIDIA Tesla A100 GPUs. We list the hyper-parameters used in the GPT architecture s domain downstream experiments in Table 8. The GLUE benchmarks for downstream tasks, such as MNLI and QNLI, are based on Ott et al. (2019). SQu AD benchmarks are based on Rajpurkar et al. (2016). More downstream tasks are implemented, as detailed in Gururangan et al. (2020). |