Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

Authors: Rasoul Shafipour, David Harrison, Maxwell Horton, JEFFREY MARKER, Houman Bedayat, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, Saman Naderiparizi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments with Llama3 70B, which is particularly challenging, show zero-shot accuracy retention at 4and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines. Additionally, FPGA-based tests demonstrate that 4-bit Seed LM, as model size increases, approaches a 4x speed-up over an FP16 Llama 2/3 baseline.
Researcher Affiliation	Collaboration	Rasoul Shafipour1, David Harrison1, Maxwell Horton1, Jeff Marker1, Houman Bedayat1, Sachin Mehta2, Mohammad Rastegari3, Mahyar Najibi1, Saman Naderiparizi1 1Apple 2University of Washington 3Meta AI
Pseudocode	Yes	A summary of the algorithmic implementation can be found in Appendix A.3. [...] The pseudocode for this process is provided in Algorithm 1. Algorithm 1 LFSR Sequence Generation. [...] Algorithm 2 Reconstruction Process in Seed LM. [...] Algorithm 3 Seed and Coefficient Selection for a Weight Block.
Open Source Code	No	We compare our method against established compression techniques such as AWQ (Lin et al., 2024) and Omni Quant (Shao et al., 2023), using the official Git Hub repositories for each baseline as of September 2024.
Open Datasets	Yes	To evaluate the quality of Seed LM, we measure perplexity on the Wiki Text-2 dataset (Merity et al., 2016) and assess accuracy across various zero-shot tasks using LM Evaluation Harness (Gao et al., 2021)1.
Dataset Splits	Yes	To evaluate language model performance, we measure perplexity on Wiki Text-2 using 166 test windows of 2048 tokens each. [...] For all compression methods, we use LM Evaluation Harness v0.4.3 and the following task versions: arc-challenge=1.0, arc-easy=1.0, hellaswag=1.0, winogrande=1.0, boolq=2.0.
Hardware Specification	Yes	Figure 3 shows the RTL design block diagram, with the target device being an AMD Virtex7 FPGA XC7V585T-3 (2021).
Software Dependencies	Yes	For all compression methods, we use LM Evaluation Harness v0.4.3 and the following task versions: arc-challenge=1.0, arc-easy=1.0, hellaswag=1.0, winogrande=1.0, boolq=2.0.
Experiment Setup	Yes	The quantization scheme for the vector t plays a critical role in balancing reconstruction accuracy with bit efficiency, adhering to our bit budget constraints. We represent each element of t as a 4-bit 2 s complement integer, paired with a shared 4-bit exponent. [...] Table 1: Selected configurations of C, P, and K for M = 3 and M = 4.