reproducibilityindex.ai

PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

Authors: Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Po SE greatly reduces memory and time overhead compared with Full-length fine-tuning, with minimal impact on performance. In this section, we conduct experiments to verify the effectiveness of Po SE for context window extension.
Researcher Affiliation	Collaboration	School of Computer Science, Peking University National Key Laboratory for Multimedia Information Processing, Peking University Microsoft Corporation
Pseudocode	No	The paper includes a figure with Python code for a calculation, but no formal pseudocode or algorithm blocks describing the main methodology (Po SE).
Open Source Code	Yes	https://github.com/dwzhu-pku/Po SE
Open Datasets	Yes	The fine-tuning dataset is sourced from The Pile (Gao et al., 2020), with a minimum length requirement of 2,048 tokens. ... We evaluate language modeling on Gov Report (Huang et al., 2021) and Proof-pile (Zhangir et al., 2022) datasets. ... Given the need to evaluate on extremely long documents, we have opted to employ two book datasets, namely Books3 (Presser, 2020) and Gutenberg (PG-19) (Rae et al., 2019).
Dataset Splits	No	The paper uses datasets for fine-tuning and evaluation but does not specify explicit training/validation/test splits, percentages, or sample counts for reproducibility of data partitioning.
Hardware Specification	Yes	This training process comprises 1,000 steps, employing a global batch size of 64 on 8 V100 GPUs using Deepspeed Ze RO stage 3 (Rajbhandari et al., 2020). ... For evaluation, we use a single A100 GPU.
Software Dependencies	Yes	This training process comprises 1,000 steps, employing a global batch size of 64 on 8 V100 GPUs using Deepspeed Ze RO stage 3 (Rajbhandari et al., 2020). ... Flash Attention V2 (Dao, 2023) is applied...
Experiment Setup	Yes	This training process comprises 1,000 steps, employing a global batch size of 64 on 8 V100 GPUs using Deepspeed Ze RO stage 3 (Rajbhandari et al., 2020). We use learning rate 2e 5 and a linear scheduler, with 10 warmup steps. We use Adam W optimizer with its default hyperparameters setup.