Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

Authors: Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate the proposed method s superior performance and memory efficiency, enabling the broader use of LLMs in contexts requiring extended context. ... Through extensive evaluation on downstream tasks and perplexity measurements, we demonstrate that HOMER can effectively extend pre-trained LLMs to handle long inputs beyond their context limits.
Researcher Affiliation Collaboration Woomin Song1, Seunghyuk Oh1, Sangwoo Mo2 Jaehyung Kim3, Sukmin Yun4, Jung-Woo Ha5 Jinwoo Shin1 1KAIST 2University of Michigan 3Carnegie Mellon University 4Hanyang University ERICA 5NAVER
Pseudocode Yes Algorithm 1 Memory-efficient computation ordering
Open Source Code Yes Code is available at https://github.com/alinlab/HOMER.
Open Datasets Yes We select Llama-2 as our base model... To this end, we sample 25 long documents from the PG-19 dataset (Rae et al., 2019)... To this end, we measure the model s performance on the validation set of Qu ALITY (Pang et al., 2021).
Dataset Splits Yes We measure the model s performance on the validation set of Qu ALITY (Pang et al., 2021). ... Calibration is performed using 100 text corpora segments from the validation set and the test set of Wiki Text-103 (Merity et al., 2016).
Hardware Specification Yes All efficiency measurements are done with a single A100 GPU. ... All measurements are taken on a single A100 GPU, with Flash Attention 2 (Dao, 2023) applied.
Software Dependencies No The paper mentions 'Flash Attention 2 (Dao, 2023)' and 'Llama-2' but does not specify version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We select Llama-2 as our base model... In all experiments involving HOMER, the maximum chunk length was set to be half of the context limit. We assign 12 additional layers for 7b models and 20 layers for 13b models. Calibration is performed using 100 text corpora segments from the validation set and the test set of Wiki Text-103 (Merity et al., 2016).