PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

Authors: Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Po SE greatly reduces memory and time overhead compared with Full-length fine-tuning, with minimal impact on performance. In this section, we conduct experiments to verify the effectiveness of Po SE for context window extension.
Researcher Affiliation Collaboration School of Computer Science, Peking University National Key Laboratory for Multimedia Information Processing, Peking University Microsoft Corporation
Pseudocode No The paper includes a figure with Python code for a calculation, but no formal pseudocode or algorithm blocks describing the main methodology (Po SE).
Open Source Code Yes https://github.com/dwzhu-pku/Po SE
Open Datasets Yes The fine-tuning dataset is sourced from The Pile (Gao et al., 2020), with a minimum length requirement of 2,048 tokens. ... We evaluate language modeling on Gov Report (Huang et al., 2021) and Proof-pile (Zhangir et al., 2022) datasets. ... Given the need to evaluate on extremely long documents, we have opted to employ two book datasets, namely Books3 (Presser, 2020) and Gutenberg (PG-19) (Rae et al., 2019).
Dataset Splits No The paper uses datasets for fine-tuning and evaluation but does not specify explicit training/validation/test splits, percentages, or sample counts for reproducibility of data partitioning.
Hardware Specification Yes This training process comprises 1,000 steps, employing a global batch size of 64 on 8 V100 GPUs using Deepspeed Ze RO stage 3 (Rajbhandari et al., 2020). ... For evaluation, we use a single A100 GPU.
Software Dependencies Yes This training process comprises 1,000 steps, employing a global batch size of 64 on 8 V100 GPUs using Deepspeed Ze RO stage 3 (Rajbhandari et al., 2020). ... Flash Attention V2 (Dao, 2023) is applied...
Experiment Setup Yes This training process comprises 1,000 steps, employing a global batch size of 64 on 8 V100 GPUs using Deepspeed Ze RO stage 3 (Rajbhandari et al., 2020). We use learning rate 2e 5 and a linear scheduler, with 10 warmup steps. We use Adam W optimizer with its default hyperparameters setup.