Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LoLCATs: On Low-Rank Linearizing of Large Language Models

Authors: Michael Zhang, Simran Arora, Rahul Chalamala, Benjamin Spector, Alan Wu, Krithik Ramesh, Aaryan Singhal, Christopher Re

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, we validate that LOLCATS improves on each of our desired criteria. On quality, when linearizing popular LLMs such as Mistral-7B and Llama 3 8B, LOLCATS substantially improves past linearizing methods (by 1.1 8.6 points (pts) on zero-shot LM Eval tasks; +17.2 pts on 5-shot MMLU)).
Researcher Affiliation	Collaboration	Stanford University, Together AI, California Institute of Technology, MIT EMAIL
Pseudocode	Yes	We summarize LOLCATS with Alg. 1, 2, providing pseudocode in App. C.1.
Open Source Code	Yes	Our code is also available at https://github.com/Hazy Research/lolcats
Open Datasets	Yes	We use the 50K samples of a cleaned Alpaca dataset2, due to its ability to improve general instruction-following in 7B LLMs despite its relatively small size (Taori et al., 2023). [2https://huggingface.co/datasets/yahma/alpaca-cleaned] ... a subset4 of Red Pajama (Computer, 2023)). [4https://huggingface.co/datasets/togethercomputer/Red Pajama-Data-1T-Sample]
Dataset Splits	Yes	We evaluate their validation set perplexity (Table 2, Fig. 3) and downstream LM Eval zero-shot quality (Table 4). We use the same data for both stages, early stopping... We evaluate the best checkpoints based on validation set perplexity.
Hardware Specification	Yes	generating 4096 tokens on an 80GB H100... This also only takes 40M tokens, i.e., 0.003% and 0.04% of prior pretraining and linearizing methods token counts. On scalability, with LOLCATS we scale up linearizing to support Llama 3.1 70B and 405B LLMs (Dubey et al., 2024). LOLCATS presents the first viable approach to linearizing larger LLMs. We create the first linearized 70B LLM, taking only 18 hours on one 8 80GB H100 node, and the first linearized 405B LLM with a combination of 5 hours on 14 80GB H100 GPUs (attention transfer) + 16 hours on three 8 80GB H100 nodes (Lo RA finetuning) for Llama 3.1 405B. For both models, this amount to under half the total GPU hours than prior methods reported to linearize 8B models (5 days on 8 80GB A100s) (Wang et al., 2024).
Software Dependencies	No	Explanation: The paper mentions several software tools and libraries like Hugging Face Transformers, Flash Attention-2, PyTorch FSDP, and the Thunder Kittens CUDA framework, but does not provide specific version numbers for any of them.
Experiment Setup	Yes	Hyperparameters. We list all model and training hyperparameters in Table 7. For learning rates, we did an initial sweep over {1e-2, 1e-3, 1e-4}, choosing the best based on final validation set perplexity during step 2: low-rank adjusting, and checkpointing with early stopping. We did not tune batch size or choice of optimizer, and used default values informed by prior work for other design parameters such as sliding window size (Arora et al., 2024), Lo RA rank, and Lo RA projection layers (Hu et al., 2021).