Training-Free Long-Context Scaling of Large Language Models

Authors: Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a comprehensive evaluation of our models on a diverse range of tasks that include language modeling, passkey retrieval, and real-world long-context applications that span question answering (Pang et al., 2022; Koˇcisk y et al., 2018; Dasigi et al., 2021; An et al., 2023) and summarization (Zhong et al., 2021). ... 4. Experiments
Researcher Affiliation Collaboration Chenxin An * 1 2 Fei Huang 2 Jun Zhang Shansan Gong 1 Xipeng Qiu 3 Chang Zhou 2 Lingpeng Kong 1 ... *Work done during internship at Alibaba Group 1The University of Hong Kong 2Alibaba Group 3Fudan University.
Pseudocode Yes The Py Torch-style pseudocode for how integrating DCA with Flash Attention 2 (Dao, 2023), can be found in Algorithm 1. The explanation and complexity analysis of the code can be found in Appendix A.3.
Open Source Code Yes All code and data used in this work are released at https: //github.com/HKUNLP/Chunk Llama.
Open Datasets Yes We evaluate the long sequence language modeling performance of our CHUNKLLAMA2 on the book corpus dataset PG19 (Rae et al., 2020)... The training dataset is sourced from Share GPT7 and Alpaca GPT4 (Taori et al., 2023).
Dataset Splits No The paper mentions evaluating on the 'PG19 validation set' and discusses some training data for finetuning (e.g., '5,405 training instances'), but it does not provide explicit train/validation/test split percentages or sample counts for reproducibility of all experiments.
Hardware Specification Yes for the 7B/13B variants of CHUNKLLAMA2, we only need one single NVIDIA A100-80G GPU for the inference. When scaling up to 70B models, two A100 GPUs are enough to manage inference within a 16k context length.
Software Dependencies No The Py Torch-style pseudocode for how integrating DCA with Flash Attention 2 (Dao, 2023), can be found in Algorithm 1. The paper mentions software like 'PyTorch' and 'Flash Attention 2' but does not provide specific version numbers for all key dependencies to ensure reproducibility (e.g., PyTorch version is not specified).
Experiment Setup Yes The chunk size s can be typically set to 3 4 training length and for Llama2, this value is 3072. ... We further finetune Llama2 with over 16k steps with a batch size of 1.