reproducibilityindex.ai

Zoology: Measuring and Improving Recall in Efficient Language Models

Authors: Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Re

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We pretrain and evaluate 17 language models across 4 scales (70M 1.4Bn) and 5 architectures on the same data and infrastructure setup. Surprisingly, we find that there is still a perplexity gap of up to 2.1 points between state-of-the-art convolution-based architectures and strong Transformer baselines in language modeling on the Pile (Table 1). Through fine-grained analysis, we find a single, simple capability is responsible for much of the gap: recalling information seen in-context.
Researcher Affiliation	Academia	Simran Arora , Sabri Eyuboglu , Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher R e Department of Computer Science, Stanford University Stanford University {simran, eyuboglu, poli, jamesz, chrismre}@cs.stanford.edu {isysjohn, atri}@buffalo.edu {atimalsi}@purdue.edu
Pseudocode	Yes	Algorithm 2 Projection (u, h) Algorithm 3 Hyena (u, h, hs) Algorithm 4 RWKVProjection (u, h) Algorithm 5 RWKV (u, h, hs) Algorithm 6 Ret Net (u, h, hs) Algorithm 7 BASECONV (u, W , b1, h, b2)
Open Source Code	Yes	Code is at: https://github.com/Hazy Research/zoology.
Open Datasets	Yes	We pretrain a suite of large language models with different sequence mixers across 3 scales (70M-360M) for 10B tokens on the standard Pile language modeling setting using the Eleuther AI GPT-Neo X training infrastructure (Gao et al., 2020; Andonian et al., 2023).
Dataset Splits	Yes	We stratify Pile validation data for models from each architecture class by whether or not the predicted token is a previously seen bigram in the example context.
Hardware Specification	Yes	We use A100 80GB Nvidia GPUs to run all experiments.
Software Dependencies	No	The paper mentions using 'Py Torch' and 'GPT2BPETokenizer' and refers to the 'Eleuther AI GPT-Neo X training infrastructure', but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	Below we provide details on the hyperparameters and settings for training each architecture studied in the paper on the real-world Pile data. (See Tables 7, 9, 11, 13, 14, 15, 16 for specific hyperparameters such as Optimizer, Learning rate, Global batch size, Num Layers, Hidden Size, FFN Width, etc.)