Zoology: Measuring and Improving Recall in Efficient Language Models

Authors: Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Re

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We pretrain and evaluate 17 language models across 4 scales (70M 1.4Bn) and 5 architectures on the same data and infrastructure setup. Surprisingly, we find that there is still a perplexity gap of up to 2.1 points between state-of-the-art convolution-based architectures and strong Transformer baselines in language modeling on the Pile (Table 1). Through fine-grained analysis, we find a single, simple capability is responsible for much of the gap: recalling information seen in-context.
Researcher Affiliation Academia Simran Arora , Sabri Eyuboglu , Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher R e Department of Computer Science, Stanford University Stanford University {simran, eyuboglu, poli, jamesz, chrismre}@cs.stanford.edu {isysjohn, atri}@buffalo.edu {atimalsi}@purdue.edu
Pseudocode Yes Algorithm 2 Projection (u, h) Algorithm 3 Hyena (u, h, hs) Algorithm 4 RWKVProjection (u, h) Algorithm 5 RWKV (u, h, hs) Algorithm 6 Ret Net (u, h, hs) Algorithm 7 BASECONV (u, W , b1, h, b2)
Open Source Code Yes Code is at: https://github.com/Hazy Research/zoology.
Open Datasets Yes We pretrain a suite of large language models with different sequence mixers across 3 scales (70M-360M) for 10B tokens on the standard Pile language modeling setting using the Eleuther AI GPT-Neo X training infrastructure (Gao et al., 2020; Andonian et al., 2023).
Dataset Splits Yes We stratify Pile validation data for models from each architecture class by whether or not the predicted token is a previously seen bigram in the example context.
Hardware Specification Yes We use A100 80GB Nvidia GPUs to run all experiments.
Software Dependencies No The paper mentions using 'Py Torch' and 'GPT2BPETokenizer' and refers to the 'Eleuther AI GPT-Neo X training infrastructure', but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Below we provide details on the hyperparameters and settings for training each architecture studied in the paper on the real-world Pile data. (See Tables 7, 9, 11, 13, 14, 15, 16 for specific hyperparameters such as Optimizer, Learning rate, Global batch size, Num Layers, Hidden Size, FFN Width, etc.)