Language Modeling Is Compression

Authors: Gregoire Deletang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, Joel Veness

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses Image Net patches to 43.4% and Libri Speech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively.
Researcher Affiliation Collaboration 1Google DeepMind. 2Meta AI & Inria. Correspondence to {gdelt, anianr}@google.com.
Pseudocode No The paper includes Figure 1 and Appendix A which illustrate the arithmetic encoding process with an example, but this is a detailed explanation rather than structured pseudocode or an algorithm block.
Open Source Code Yes details in Appendix B and code at https://github.com/google-deepmind/language_modeling_is_compression
Open Datasets Yes enwik9 The enwik9 dataset (Hutter, 2006) consists of the first 1 000 000 000 (1 billion) bytes of the English Wikipedia XML dump on March 3rd, 2006 and is typically used to measure a model s ability to compress data. It is an extension of the enwik8 dataset that only contains the first 100 million bytes. Image Net The Image Net dataset (Russakovsky et al., 2015) contains 14 197 122 annotated images from the Word Net hierarchy. Libri Speech Libri Speech (Panayotov et al., 2015) contains roughly 1000 hours of 16k Hz English speech data derived from audiobooks of the Libri Vox project that has been segmented and aligned.
Dataset Splits No The paper mentions training on enwik8 and evaluating on enwik9, and describes data chunking. However, it does not provide specific percentages or counts for explicit train/validation/test splits, nor does it refer to predefined splits with citations for reproducibility beyond the dataset names themselves.
Hardware Specification No The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud resources with specifications).
Software Dependencies No The paper mentions software like 'Python', 'Sentence Piece', and uses 'gzip', 'LZMA2' as comparators, but it does not provide specific version numbers for any of these software components or libraries.
Experiment Setup Yes For Transformers, we consider the latter approach since sliding would increase their (already very long) running time by a factor of S. Therefore, we chunk all datasets into sequences of 2048 bytes and feed them to the compressors one-by-one. We encode each neural network parameter with 2 bytes, using a float16 representation. The data fed to the large language models we use (Chinchilla and LLama2) is an ASCII string of exactly 2048 characters. The string is transformed into a sequence of integer tokens between 0 and T, T being the vocabulary size (they both use T = 32000). We chose k = 100, which almost fully recovers the conditional distribution.