LoCoCo: Dropping In Convolutions for Long Context Compression

Authors: Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that Lo Co Co maintains consistently outstanding performance across various context lengths and can achieve a high context compression rate during both inference and finetuning phases.
Researcher Affiliation Collaboration 1University of Texas at Austin 2Meta AI (FAIR) 3Carnegie Mellon University.
Pseudocode Yes Algorithm 1 Segment-level Attention (Training Time) and Algorithm 2 Lo Co Co Attention (Training Time) are presented.
Open Source Code Yes Codes are available at: https: //github.com/VITA-Group/Lo Co Co.
Open Datasets Yes We use Red Pajama (Computer, 2023) as our training dataset. For post-hoc compression experiments, we only tune compression heads for 200 steps without modifying the pre-trained LLM. For context length extending, we fine-tune the convolutional heads and Lo RA adapters (rank 8), and also allow modifying the embedding and normalization layers, all following Chen et al. (2023b). We select the reading comprehension dataset RACE (Lai et al., 2017) (2, 4, 6 shots), the closed-book question answering dataset Trivia QA (Joshi et al., 2017) (50 shots), and the common sense reasoning dataset: Hella Swag (Zellers et al., 2019) (10, 20, 40 shots), Wino Grande (Sakaguchi et al., 2021) (70 shots), and ARC easy and challenge (Clark et al., 2018) (40 shots).
Dataset Splits No The paper does not explicitly provide specific details on training/validation/test dataset splits, percentages, or absolute sample counts for validation, nor does it cite predefined validation splits.
Hardware Specification Yes All experiments are run on A6000 (48GB memory) to intentionally test our efficacy with small-memory GPUs, and we use per-device batch size as 1. The measurements are conducted on the NVIDIA A100 80GB GPU.
Software Dependencies Yes We use Flash Attention v2 (Dao, 2023) and Deep Speed Stage 2 by default.
Experiment Setup Yes For all experiments, we use the learning rates of 5 10 5 for Lo RA adapters, embedding and normalization layers and 5 10 2 for convolutional heads, with linear learning rate schedule. We use the batch size of 128, and chunk size of 512.