Zoology: Measuring and Improving Recall in Efficient Language Models
Authors: Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Re
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pretrain and evaluate 17 language models across 4 scales (70M 1.4Bn) and 5 architectures on the same data and infrastructure setup. Surprisingly, we find that there is still a perplexity gap of up to 2.1 points between state-of-the-art convolution-based architectures and strong Transformer baselines in language modeling on the Pile (Table 1). Through fine-grained analysis, we find a single, simple capability is responsible for much of the gap: recalling information seen in-context. |
| Researcher Affiliation | Academia | Simran Arora , Sabri Eyuboglu , Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher R e Department of Computer Science, Stanford University Stanford University {simran, eyuboglu, poli, jamesz, chrismre}@cs.stanford.edu {isysjohn, atri}@buffalo.edu {atimalsi}@purdue.edu |
| Pseudocode | Yes | Algorithm 2 Projection (u, h) Algorithm 3 Hyena (u, h, hs) Algorithm 4 RWKVProjection (u, h) Algorithm 5 RWKV (u, h, hs) Algorithm 6 Ret Net (u, h, hs) Algorithm 7 BASECONV (u, W , b1, h, b2) |
| Open Source Code | Yes | Code is at: https://github.com/Hazy Research/zoology. |
| Open Datasets | Yes | We pretrain a suite of large language models with different sequence mixers across 3 scales (70M-360M) for 10B tokens on the standard Pile language modeling setting using the Eleuther AI GPT-Neo X training infrastructure (Gao et al., 2020; Andonian et al., 2023). |
| Dataset Splits | Yes | We stratify Pile validation data for models from each architecture class by whether or not the predicted token is a previously seen bigram in the example context. |
| Hardware Specification | Yes | We use A100 80GB Nvidia GPUs to run all experiments. |
| Software Dependencies | No | The paper mentions using 'Py Torch' and 'GPT2BPETokenizer' and refers to the 'Eleuther AI GPT-Neo X training infrastructure', but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Below we provide details on the hyperparameters and settings for training each architecture studied in the paper on the real-world Pile data. (See Tables 7, 9, 11, 13, 14, 15, 16 for specific hyperparameters such as Optimizer, Learning rate, Global batch size, Num Layers, Hidden Size, FFN Width, etc.) |