reproducibilityindex.ai

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

Authors: Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate the proposed method s superior performance and memory efficiency, enabling the broader use of LLMs in contexts requiring extended context. ... Through extensive evaluation on downstream tasks and perplexity measurements, we demonstrate that HOMER can effectively extend pre-trained LLMs to handle long inputs beyond their context limits.
Researcher Affiliation	Collaboration	Woomin Song1, Seunghyuk Oh1, Sangwoo Mo2 Jaehyung Kim3, Sukmin Yun4, Jung-Woo Ha5 Jinwoo Shin1 1KAIST 2University of Michigan 3Carnegie Mellon University 4Hanyang University ERICA 5NAVER
Pseudocode	Yes	Algorithm 1 Memory-efficient computation ordering
Open Source Code	Yes	Code is available at https://github.com/alinlab/HOMER.
Open Datasets	Yes	We select Llama-2 as our base model... To this end, we sample 25 long documents from the PG-19 dataset (Rae et al., 2019)... To this end, we measure the model s performance on the validation set of Qu ALITY (Pang et al., 2021).
Dataset Splits	Yes	We measure the model s performance on the validation set of Qu ALITY (Pang et al., 2021). ... Calibration is performed using 100 text corpora segments from the validation set and the test set of Wiki Text-103 (Merity et al., 2016).
Hardware Specification	Yes	All efficiency measurements are done with a single A100 GPU. ... All measurements are taken on a single A100 GPU, with Flash Attention 2 (Dao, 2023) applied.
Software Dependencies	No	The paper mentions 'Flash Attention 2 (Dao, 2023)' and 'Llama-2' but does not specify version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We select Llama-2 as our base model... In all experiments involving HOMER, the maximum chunk length was set to be half of the context limit. We assign 12 additional layers for 7b models and 20 layers for 13b models. Calibration is performed using 100 text corpora segments from the validation set and the test set of Wiki Text-103 (Merity et al., 2016).