reproducibilityindex.ai

Training-Free Long-Context Scaling of Large Language Models

Authors: Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a comprehensive evaluation of our models on a diverse range of tasks that include language modeling, passkey retrieval, and real-world long-context applications that span question answering (Pang et al., 2022; Koˇcisk y et al., 2018; Dasigi et al., 2021; An et al., 2023) and summarization (Zhong et al., 2021). ... 4. Experiments
Researcher Affiliation	Collaboration	Chenxin An * 1 2 Fei Huang 2 Jun Zhang Shansan Gong 1 Xipeng Qiu 3 Chang Zhou 2 Lingpeng Kong 1 ... *Work done during internship at Alibaba Group 1The University of Hong Kong 2Alibaba Group 3Fudan University.
Pseudocode	Yes	The Py Torch-style pseudocode for how integrating DCA with Flash Attention 2 (Dao, 2023), can be found in Algorithm 1. The explanation and complexity analysis of the code can be found in Appendix A.3.
Open Source Code	Yes	All code and data used in this work are released at https: //github.com/HKUNLP/Chunk Llama.
Open Datasets	Yes	We evaluate the long sequence language modeling performance of our CHUNKLLAMA2 on the book corpus dataset PG19 (Rae et al., 2020)... The training dataset is sourced from Share GPT7 and Alpaca GPT4 (Taori et al., 2023).
Dataset Splits	No	The paper mentions evaluating on the 'PG19 validation set' and discusses some training data for finetuning (e.g., '5,405 training instances'), but it does not provide explicit train/validation/test split percentages or sample counts for reproducibility of all experiments.
Hardware Specification	Yes	for the 7B/13B variants of CHUNKLLAMA2, we only need one single NVIDIA A100-80G GPU for the inference. When scaling up to 70B models, two A100 GPUs are enough to manage inference within a 16k context length.
Software Dependencies	No	The Py Torch-style pseudocode for how integrating DCA with Flash Attention 2 (Dao, 2023), can be found in Algorithm 1. The paper mentions software like 'PyTorch' and 'Flash Attention 2' but does not provide specific version numbers for all key dependencies to ensure reproducibility (e.g., PyTorch version is not specified).
Experiment Setup	Yes	The chunk size s can be typically set to 3 4 training length and for Llama2, this value is 3072. ... We further finetune Llama2 with over 16k steps with a batch size of 1.