Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization

Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that Replace Me consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), Replace Me achieves up to 25% pruning while retaining approximately 90% of the original model s performance on open benchmarks without any training or healing steps, resulting in minimal computational overhead. Section 3 then provides comprehensive experimental results and ablation studies, demonstrating the effectiveness and robustness of our method, and analyzing the key factors that influence its performance.
Researcher Affiliation	Collaboration	Dmitriy Shopkhoev MWS AI, ITMO University Ammar Ali MWS AI, ITMO University Magauiya Zhussip MWS AI Valentin Malykh MWS AI, ITMO University, IITU University Stamatios Lefkimmiatis MWS AI Nikos Komodakis University of Crete, IACM-Forth, Archimedes Athena RC Sergey Zagoruyko Polynome
Pseudocode	No	The paper describes methods and equations for the proposed approach (e.g., Section 2.1 Layers selection, Section 2.2 Linear Transform Estimation), but it does not present these in structured pseudocode or algorithm blocks. Instead, the steps are explained in narrative text and mathematical formulas.
Open Source Code	Yes	We provide an opensource library implementing Replace Me alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/Replace Me.
Open Datasets	Yes	In Table 1 we present results on different benchmarks that have been widely used in previous research. These benchmarks have been introduced in the following works: CMNLI [56], Hella Swag [57], PIQA [2], CHID [59], WSC [24], MMLU [16], CMMLU [25], Race-High/Middle [22], C3 [46]. Additionally, we benchmarked Replace Me using well-established public datasets, namely Winogrande [41], Bool Q [4], Open Book QA [31], Sci Q [55], and Lambada Open AI [35]. Our primary experiments utilized datasets such as Arcee [5], Fine Web [36], and Slim Orca [26], consistent with prior work like UIDL [13].
Dataset Splits	Yes	In this test, we used the training part of the Sci Q dataset as a calibration data and then measured performance on its test set. For comparison, we also applied our method using part of the general Slim Orca dataset but still evaluated it on Sci Q. Using the Llama3 8B model, we found that calibration data from the same task leads to much better accuracy than calibration with general-purpose data. This shows that tailoring calibration data to the target task can significantly improve compressed model performance. See the results in Table 17.
Hardware Specification	Yes	All experiments were conducted using an NVIDIA A100-SXM4-40GB GPU with an AMD EPYC 7742 64-Core Processor, running Ubuntu 22.04 and Python 3.10. The software environment was based on the official NVIDIA Py Torch container nvcr.io/nvidia/pytorch:23.10-py3. For additional testing and validation, models were also tested on a P100 GPU using the Kaggle environment, which imposes stricter compute and memory constraints.
Software Dependencies	Yes	All experiments were conducted using an NVIDIA A100-SXM4-40GB GPU with an AMD EPYC 7742 64-Core Processor, running Ubuntu 22.04 and Python 3.10. The software environment was based on the official NVIDIA Py Torch container nvcr.io/nvidia/pytorch:23.10-py3.
Experiment Setup	Yes	For numerical estimation of the linear transform, we used Adam optimizer with LR 1e 4 and batch size 1024, iterating for 10 epochs over the calibration data. In Table 14 we present hyperparameters and configuration for Replace Me(Cosine-based Training): Optimizer Adam, Learning Rate 0.0001, Batch Size 1024, Epochs 10, Loss Function Cosine Distance, Loss Weight Initialization Identity Matrix, Bias False. Table 15 details hyperparameters for Healing Experiments (full_transform vs. full_model).