CLEX: Continuous Length Extrapolation for Large Language Models

Authors: Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, Lidong Bing

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of CLEX on two datasets: (1) a subset of Red Pajama Book (Computer, 2023) for long-context language modelling, and (2) Long Bench (Bai et al., 2023) for long-context practical tasks. Empirically, CLEX demonstrates remarkable length extrapolation ability in language modelling, which can extend the context window to more than 4 training length without any performance deterioration.
Researcher Affiliation Collaboration 1Sun Yat-sen University 2DAMO Academy, Alibaba Group 3Hupan Lab, 310023, Hangzhou, China 4University of Glasgow 5Mohamed bin Zayed University of Artificial Intelligence
Pseudocode Yes The training procedure of CLEX is shown in Alg. 1. Algorithm 1 Training Procedure of CLEX
Open Source Code Yes Our code is available at https://github.com/DAMO-NLP-SG/CLEX.
Open Datasets Yes We evaluate the performance of CLEX on two datasets: (1) a subset of Red Pajama Book (Computer, 2023) for long-context language modelling, and (2) Long Bench (Bai et al., 2023) for long-context practical tasks.
Dataset Splits No The paper states that models are trained on 2B tokens and evaluated on 20 million tokens, but it does not specify explicit training, validation, and test dataset splits (e.g., percentages or counts for a validation set) needed for full reproducibility.
Hardware Specification Yes Given a context length of 16k in Long Bench with a generation length of 512, the generation throughput between our CLEX-7B and LLa MA-2-7B is comparable (27.8 tokens/s vs 28.3 tokens/s, in a single A100).
Software Dependencies No The paper mentions software like 'Flash Attention' and 'Adam optimiser' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We set the learning rate of 2e-5 for all models, which are optimised by Adam optimiser (Kingma & Ba, 2015). The batch size is set to 64k tokens for 7B models and 16k tokens for 13B models. The maximum desired t during training in CLEX (namely t Train in 3.3) is set as 16 for LLa MA-2. The amplification factor of ODE layer λ is set as 1 for all 7B models and 2 for 13B models. For the instruction tuning, we train our model using the Ultra Chat dataset for 1 epoch, starting with the checkpoint after the training of language modelling.