reproducibilityindex.ai

CLEX: Continuous Length Extrapolation for Large Language Models

Authors: Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, Lidong Bing

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the performance of CLEX on two datasets: (1) a subset of Red Pajama Book (Computer, 2023) for long-context language modelling, and (2) Long Bench (Bai et al., 2023) for long-context practical tasks. Empirically, CLEX demonstrates remarkable length extrapolation ability in language modelling, which can extend the context window to more than 4 training length without any performance deterioration.
Researcher Affiliation	Collaboration	1Sun Yat-sen University 2DAMO Academy, Alibaba Group 3Hupan Lab, 310023, Hangzhou, China 4University of Glasgow 5Mohamed bin Zayed University of Artificial Intelligence
Pseudocode	Yes	The training procedure of CLEX is shown in Alg. 1. Algorithm 1 Training Procedure of CLEX
Open Source Code	Yes	Our code is available at https://github.com/DAMO-NLP-SG/CLEX.
Open Datasets	Yes	We evaluate the performance of CLEX on two datasets: (1) a subset of Red Pajama Book (Computer, 2023) for long-context language modelling, and (2) Long Bench (Bai et al., 2023) for long-context practical tasks.
Dataset Splits	No	The paper states that models are trained on 2B tokens and evaluated on 20 million tokens, but it does not specify explicit training, validation, and test dataset splits (e.g., percentages or counts for a validation set) needed for full reproducibility.
Hardware Specification	Yes	Given a context length of 16k in Long Bench with a generation length of 512, the generation throughput between our CLEX-7B and LLa MA-2-7B is comparable (27.8 tokens/s vs 28.3 tokens/s, in a single A100).
Software Dependencies	No	The paper mentions software like 'Flash Attention' and 'Adam optimiser' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We set the learning rate of 2e-5 for all models, which are optimised by Adam optimiser (Kingma & Ba, 2015). The batch size is set to 64k tokens for 7B models and 16k tokens for 13B models. The maximum desired t during training in CLEX (namely t Train in 3.3) is set as 16 for LLa MA-2. The amplification factor of ODE layer λ is set as 1 for all 7B models and 2 for 13B models. For the instruction tuning, we train our model using the Ultra Chat dataset for 1 epoch, starting with the checkpoint after the training of language modelling.