Language Modeling Is Compression
Authors: Gregoire Deletang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, Joel Veness
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses Image Net patches to 43.4% and Libri Speech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. |
| Researcher Affiliation | Collaboration | 1Google DeepMind. 2Meta AI & Inria. Correspondence to {gdelt, anianr}@google.com. |
| Pseudocode | No | The paper includes Figure 1 and Appendix A which illustrate the arithmetic encoding process with an example, but this is a detailed explanation rather than structured pseudocode or an algorithm block. |
| Open Source Code | Yes | details in Appendix B and code at https://github.com/google-deepmind/language_modeling_is_compression |
| Open Datasets | Yes | enwik9 The enwik9 dataset (Hutter, 2006) consists of the first 1 000 000 000 (1 billion) bytes of the English Wikipedia XML dump on March 3rd, 2006 and is typically used to measure a model s ability to compress data. It is an extension of the enwik8 dataset that only contains the first 100 million bytes. Image Net The Image Net dataset (Russakovsky et al., 2015) contains 14 197 122 annotated images from the Word Net hierarchy. Libri Speech Libri Speech (Panayotov et al., 2015) contains roughly 1000 hours of 16k Hz English speech data derived from audiobooks of the Libri Vox project that has been segmented and aligned. |
| Dataset Splits | No | The paper mentions training on enwik8 and evaluating on enwik9, and describes data chunking. However, it does not provide specific percentages or counts for explicit train/validation/test splits, nor does it refer to predefined splits with citations for reproducibility beyond the dataset names themselves. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud resources with specifications). |
| Software Dependencies | No | The paper mentions software like 'Python', 'Sentence Piece', and uses 'gzip', 'LZMA2' as comparators, but it does not provide specific version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | For Transformers, we consider the latter approach since sliding would increase their (already very long) running time by a factor of S. Therefore, we chunk all datasets into sequences of 2048 bytes and feed them to the compressors one-by-one. We encode each neural network parameter with 2 bytes, using a float16 representation. The data fed to the large language models we use (Chinchilla and LLama2) is an ASCII string of exactly 2048 characters. The string is transformed into a sequence of integer tokens between 0 and T, T being the vocabulary size (they both use T = 32000). We chose k = 100, which almost fully recovers the conditional distribution. |