CLEX: Continuous Length Extrapolation for Large Language Models
Authors: Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, Lidong Bing
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of CLEX on two datasets: (1) a subset of Red Pajama Book (Computer, 2023) for long-context language modelling, and (2) Long Bench (Bai et al., 2023) for long-context practical tasks. Empirically, CLEX demonstrates remarkable length extrapolation ability in language modelling, which can extend the context window to more than 4 training length without any performance deterioration. |
| Researcher Affiliation | Collaboration | 1Sun Yat-sen University 2DAMO Academy, Alibaba Group 3Hupan Lab, 310023, Hangzhou, China 4University of Glasgow 5Mohamed bin Zayed University of Artificial Intelligence |
| Pseudocode | Yes | The training procedure of CLEX is shown in Alg. 1. Algorithm 1 Training Procedure of CLEX |
| Open Source Code | Yes | Our code is available at https://github.com/DAMO-NLP-SG/CLEX. |
| Open Datasets | Yes | We evaluate the performance of CLEX on two datasets: (1) a subset of Red Pajama Book (Computer, 2023) for long-context language modelling, and (2) Long Bench (Bai et al., 2023) for long-context practical tasks. |
| Dataset Splits | No | The paper states that models are trained on 2B tokens and evaluated on 20 million tokens, but it does not specify explicit training, validation, and test dataset splits (e.g., percentages or counts for a validation set) needed for full reproducibility. |
| Hardware Specification | Yes | Given a context length of 16k in Long Bench with a generation length of 512, the generation throughput between our CLEX-7B and LLa MA-2-7B is comparable (27.8 tokens/s vs 28.3 tokens/s, in a single A100). |
| Software Dependencies | No | The paper mentions software like 'Flash Attention' and 'Adam optimiser' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We set the learning rate of 2e-5 for all models, which are optimised by Adam optimiser (Kingma & Ba, 2015). The batch size is set to 64k tokens for 7B models and 16k tokens for 13B models. The maximum desired t during training in CLEX (namely t Train in 3.3) is set as 16 for LLa MA-2. The amplification factor of ODE layer λ is set as 1 for all 7B models and 2 for 13B models. For the instruction tuning, we train our model using the Ultra Chat dataset for 1 epoch, starting with the checkpoint after the training of language modelling. |