PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training
Authors: Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Po SE greatly reduces memory and time overhead compared with Full-length fine-tuning, with minimal impact on performance. In this section, we conduct experiments to verify the effectiveness of Po SE for context window extension. |
| Researcher Affiliation | Collaboration | School of Computer Science, Peking University National Key Laboratory for Multimedia Information Processing, Peking University Microsoft Corporation |
| Pseudocode | No | The paper includes a figure with Python code for a calculation, but no formal pseudocode or algorithm blocks describing the main methodology (Po SE). |
| Open Source Code | Yes | https://github.com/dwzhu-pku/Po SE |
| Open Datasets | Yes | The fine-tuning dataset is sourced from The Pile (Gao et al., 2020), with a minimum length requirement of 2,048 tokens. ... We evaluate language modeling on Gov Report (Huang et al., 2021) and Proof-pile (Zhangir et al., 2022) datasets. ... Given the need to evaluate on extremely long documents, we have opted to employ two book datasets, namely Books3 (Presser, 2020) and Gutenberg (PG-19) (Rae et al., 2019). |
| Dataset Splits | No | The paper uses datasets for fine-tuning and evaluation but does not specify explicit training/validation/test splits, percentages, or sample counts for reproducibility of data partitioning. |
| Hardware Specification | Yes | This training process comprises 1,000 steps, employing a global batch size of 64 on 8 V100 GPUs using Deepspeed Ze RO stage 3 (Rajbhandari et al., 2020). ... For evaluation, we use a single A100 GPU. |
| Software Dependencies | Yes | This training process comprises 1,000 steps, employing a global batch size of 64 on 8 V100 GPUs using Deepspeed Ze RO stage 3 (Rajbhandari et al., 2020). ... Flash Attention V2 (Dao, 2023) is applied... |
| Experiment Setup | Yes | This training process comprises 1,000 steps, employing a global batch size of 64 on 8 V100 GPUs using Deepspeed Ze RO stage 3 (Rajbhandari et al., 2020). We use learning rate 2e 5 and a linear scheduler, with 10 warmup steps. We use Adam W optimizer with its default hyperparameters setup. |