reproducibilityindex.ai

LoCoCo: Dropping In Convolutions for Long Context Compression

Authors: Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that Lo Co Co maintains consistently outstanding performance across various context lengths and can achieve a high context compression rate during both inference and finetuning phases.
Researcher Affiliation	Collaboration	1University of Texas at Austin 2Meta AI (FAIR) 3Carnegie Mellon University.
Pseudocode	Yes	Algorithm 1 Segment-level Attention (Training Time) and Algorithm 2 Lo Co Co Attention (Training Time) are presented.
Open Source Code	Yes	Codes are available at: https: //github.com/VITA-Group/Lo Co Co.
Open Datasets	Yes	We use Red Pajama (Computer, 2023) as our training dataset. For post-hoc compression experiments, we only tune compression heads for 200 steps without modifying the pre-trained LLM. For context length extending, we fine-tune the convolutional heads and Lo RA adapters (rank 8), and also allow modifying the embedding and normalization layers, all following Chen et al. (2023b). We select the reading comprehension dataset RACE (Lai et al., 2017) (2, 4, 6 shots), the closed-book question answering dataset Trivia QA (Joshi et al., 2017) (50 shots), and the common sense reasoning dataset: Hella Swag (Zellers et al., 2019) (10, 20, 40 shots), Wino Grande (Sakaguchi et al., 2021) (70 shots), and ARC easy and challenge (Clark et al., 2018) (40 shots).
Dataset Splits	No	The paper does not explicitly provide specific details on training/validation/test dataset splits, percentages, or absolute sample counts for validation, nor does it cite predefined validation splits.
Hardware Specification	Yes	All experiments are run on A6000 (48GB memory) to intentionally test our efficacy with small-memory GPUs, and we use per-device batch size as 1. The measurements are conducted on the NVIDIA A100 80GB GPU.
Software Dependencies	Yes	We use Flash Attention v2 (Dao, 2023) and Deep Speed Stage 2 by default.
Experiment Setup	Yes	For all experiments, we use the learning rates of 5 10 5 for Lo RA adapters, embedding and normalization layers and 5 10 2 for convolutional heads, with linear learning rate schedule. We use the batch size of 128, and chunk size of 512.