Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Zoology: Measuring and Improving Recall in Efficient Language Models
Authors: Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Re
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pretrain and evaluate 17 language models across 4 scales (70M 1.4Bn) and 5 architectures on the same data and infrastructure setup. Surprisingly, we find that there is still a perplexity gap of up to 2.1 points between state-of-the-art convolution-based architectures and strong Transformer baselines in language modeling on the Pile (Table 1). Through fine-grained analysis, we find a single, simple capability is responsible for much of the gap: recalling information seen in-context. |
| Researcher Affiliation | Academia | Simran Arora , Sabri Eyuboglu , Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher R e Department of Computer Science, Stanford University Stanford University EMAIL EMAIL {atimalsi}@purdue.edu |
| Pseudocode | Yes | Algorithm 2 Projection (u, h) Algorithm 3 Hyena (u, h, hs) Algorithm 4 RWKVProjection (u, h) Algorithm 5 RWKV (u, h, hs) Algorithm 6 Ret Net (u, h, hs) Algorithm 7 BASECONV (u, W , b1, h, b2) |
| Open Source Code | Yes | Code is at: https://github.com/Hazy Research/zoology. |
| Open Datasets | Yes | We pretrain a suite of large language models with different sequence mixers across 3 scales (70M-360M) for 10B tokens on the standard Pile language modeling setting using the Eleuther AI GPT-Neo X training infrastructure (Gao et al., 2020; Andonian et al., 2023). |
| Dataset Splits | Yes | We stratify Pile validation data for models from each architecture class by whether or not the predicted token is a previously seen bigram in the example context. |
| Hardware Specification | Yes | We use A100 80GB Nvidia GPUs to run all experiments. |
| Software Dependencies | No | The paper mentions using 'Py Torch' and 'GPT2BPETokenizer' and refers to the 'Eleuther AI GPT-Neo X training infrastructure', but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Below we provide details on the hyperparameters and settings for training each architecture studied in the paper on the real-world Pile data. (See Tables 7, 9, 11, 13, 14, 15, 16 for specific hyperparameters such as Optimizer, Learning rate, Global batch size, Num Layers, Hidden Size, FFN Width, etc.) |