Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs

Authors: Mohammad Mozaffari, Amir Yazdanbakhsh, Zhao Zhang, Maryam Mehri Dehnavi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the speedup and memory reduction by SLOPE during pretraining and inference across LLMs with different model parameter sizes. To demonstrate the scalability and efficiency of our method, we conducted extensive benchmarking on OPT (2.6 B to 66 B) and LLa MA-3-8B and Mistral-v0.3-7B models. ... To assess the impact of SLOPE on model accuracy, we conducted pretraining experiments across various models and datasets (details in Appendix O).
Researcher Affiliation	Collaboration	Mohammad Mozaffari Amir Yazdanbakhsh Department of Compute Science Google Deep Mind University of Toronto Mountain View, USA EMAIL EMAIL Zhao Zhang Maryam Mehri Dehnavi Department of Electrical and Computer Engineering Department of Compute Science Rutgers University University of Toronto EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Accelerated Sparse Pretraining Algorithm for a Linear Layer
Open Source Code	Yes	Code and data for SLOPE is available at: https://bit.ly/slope-llm
Open Datasets	Yes	We pretrained both the small (117 M parameters) and large (774 M parameters) variants of GPT2 (46) on the Open Web Text dataset (1). For a fair comparison, we evaluate the models on MMLU (23), Arc Challenge (6), and Open Book QA (35) zero-shot tasks implemented in Language Model Evaluation Harness (18). ... We evaluate the performance of BERT-Large-Uncased on the SQu AD v1.1 (48) and GLUE (57) tasks.
Dataset Splits	No	The paper only mentions the datasets used (e.g., Open Web Text, MMLU, SQuAD, GLUE) and the evaluation metrics, but does not explicitly provide specific training/test/validation split percentages, sample counts, or citations to predefined splits for these datasets. For instance, it mentions evaluating validation perplexity but does not detail the split for the Open Web Text dataset.
Hardware Specification	Yes	Our experiments were conducted on the Narval and Mist clusters at Compute Canada (7) and the Lonestar 6 cluster at the Texas Advanced Computing Center (54). Each Narval node is equipped with four Nvidia A100 GPUs, each with 40GB of memory. Mist nodes feature four Nvidia V100 GPUs, each with 32GB of memory, while Lonestar 6 nodes have three Nvidia A100 GPUs, each with 40GB of memory.
Software Dependencies	No	The paper mentions several software components like cu SPARSELt, Py Torch, cu BLAS backend, and Flash Attention-2. However, it does not explicitly provide specific version numbers for these key software dependencies within the experimental setup description, which is required for reproducibility.
Experiment Setup	Yes	BERT-Large-Uncased pretraining consists of two phases... Phase 1 comprises 7,038 iterations with a global batch size of 65,536 and a sequence length of 128. Phase 2 includes 1,563 iterations with a global batch size of 32,768 and a sequence length of 512. ... To understand the impact of low-rank adapters on pretraining performance, we conducted ablations using low-rank adapter ranks of 4, 16, and 64 for 1% of the total number of iterations. ... For Extended SR-STE, we have used a decay factor of 6e-6, since it resulted in the lowest perplexity in Open Web Text.