Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
LoLCATs: On Low-Rank Linearizing of Large Language Models
Authors: Michael Zhang, Simran Arora, Rahul Chalamala, Benjamin Spector, Alan Wu, Krithik Ramesh, Aaryan Singhal, Christopher Re
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we validate that LOLCATS improves on each of our desired criteria. On quality, when linearizing popular LLMs such as Mistral-7B and Llama 3 8B, LOLCATS substantially improves past linearizing methods (by 1.1 8.6 points (pts) on zero-shot LM Eval tasks; +17.2 pts on 5-shot MMLU)). |
| Researcher Affiliation | Collaboration | Stanford University, Together AI, California Institute of Technology, MIT EMAIL |
| Pseudocode | Yes | We summarize LOLCATS with Alg. 1, 2, providing pseudocode in App. C.1. |
| Open Source Code | Yes | Our code is also available at https://github.com/Hazy Research/lolcats |
| Open Datasets | Yes | We use the 50K samples of a cleaned Alpaca dataset2, due to its ability to improve general instruction-following in 7B LLMs despite its relatively small size (Taori et al., 2023). [2https://huggingface.co/datasets/yahma/alpaca-cleaned] ... a subset4 of Red Pajama (Computer, 2023)). [4https://huggingface.co/datasets/togethercomputer/Red Pajama-Data-1T-Sample] |
| Dataset Splits | Yes | We evaluate their validation set perplexity (Table 2, Fig. 3) and downstream LM Eval zero-shot quality (Table 4). We use the same data for both stages, early stopping... We evaluate the best checkpoints based on validation set perplexity. |
| Hardware Specification | Yes | generating 4096 tokens on an 80GB H100... This also only takes 40M tokens, i.e., 0.003% and 0.04% of prior pretraining and linearizing methods token counts. On scalability, with LOLCATS we scale up linearizing to support Llama 3.1 70B and 405B LLMs (Dubey et al., 2024). LOLCATS presents the first viable approach to linearizing larger LLMs. We create the first linearized 70B LLM, taking only 18 hours on one 8 80GB H100 node, and the first linearized 405B LLM with a combination of 5 hours on 14 80GB H100 GPUs (attention transfer) + 16 hours on three 8 80GB H100 nodes (Lo RA finetuning) for Llama 3.1 405B. For both models, this amount to under half the total GPU hours than prior methods reported to linearize 8B models (5 days on 8 80GB A100s) (Wang et al., 2024). |
| Software Dependencies | No | Explanation: The paper mentions several software tools and libraries like Hugging Face Transformers, Flash Attention-2, PyTorch FSDP, and the Thunder Kittens CUDA framework, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | Hyperparameters. We list all model and training hyperparameters in Table 7. For learning rates, we did an initial sweep over {1e-2, 1e-3, 1e-4}, choosing the best based on final validation set perplexity during step 2: low-rank adjusting, and checkpointing with early stopping. We did not tune batch size or choice of optimizer, and used default values informed by prior work for other design parameters such as sliding window size (Arora et al., 2024), Lo RA rank, and Lo RA projection layers (Hu et al., 2021). |