Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

Authors: Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on seven benchmarks show that Co Me achieves state-of-the-art performance; when pruning 30% of LLa MA-2-7b s parameters, the pruned model retains 83% of its original average accuracy. We evaluate Co Me on seven LLMs, three sparsity levels (10%, 20%, 30%), seven NLP benchmarks, and two datasets, comparing against nine competitive baselines. Experimental results show that Co Me consistently outperforms existing methods.
Researcher Affiliation	Collaboration	Fei Wang1,2, Li Shen3,6, Liang Ding4, Chao Xue2, Ye Liu1, Changxing Ding1,5, 1South China University of Technology 2JD Explore Academy 3Shenzhen Campus of Sun Yat-sen University 4University of Sydney 5Pazhou Lab 6Center for AI Theoretical Foundation and Systems, Shenzhen Loop Area Institute
Pseudocode	Yes	Algorithm 1 Progressive Concatenation-based Layer Merging Strategy (Co Me) ... Algorithm 2 Progressive Posterior-based Co Me (Co Me-P) ... Algorithm 3 Co Me Single-Process Post-training (Co Me-sp) ... Algorithm 4 Co ME Multi-Process Post-training (Co Me-mp)
Open Source Code	Yes	Our code is available at https://github.com/MPI-Lab/Co Me. Our project code can be found at https://github.com/MPI-Lab/Co Me.
Open Datasets	Yes	Model performance is assessed using the lm-evaluation-harness [8] framework on seven standard benchmarks commonly adopted in model compression research: ARC-challenge (ARC-c), ARC-easy (ARC-e) [6], Hella Swag (Hella S) [43], Open Book QA (OBQA) [26], PIQA [4], Winoground (Wino G) [36] under the zero-shot setting, and MMLU [11] under the five-shot setting. Perplexity (PPL) is measured on the C4 [28] and Wikitext-2 (Wiki-2) [25] datasets.
Dataset Splits	Yes	Model performance is assessed using the lm-evaluation-harness [8] framework on seven standard benchmarks commonly adopted in model compression research: ARC-challenge (ARC-c), ARC-easy (ARC-e) [6], Hella Swag (Hella S) [43], Open Book QA (OBQA) [26], PIQA [4], Winoground (Wino G) [36] under the zero-shot setting, and MMLU [11] under the five-shot setting. Accuracy is reported with normalized option lengths to ensure comparability. We conducted pruning experiments on LLa MA-2-7b... using the Wiki-2 calibration set (256 samples) with a default sparsity of 30%. ... Men et al. [24] use the PG19 long-document dataset for calibration, and we control the size of the calibration dataset to 256 samples. ... When selecting MMLU data, MKA randomly samples five samples from 50 sub-tasks. In our implementation, we uniformly sample 250 samples from each sub-task. The LLM-Streamline trains a merged layer using 30,000 samples and employs five epochs, following the settings of Chen et al. [5]. The post-training process for the Fuse GPT method is synchronized with the pruning process, utilizing 1,024 samples from the Wiki-2 dataset.
Hardware Specification	Yes	All experiments are conducted using an A100-40G GPU.
Software Dependencies	No	The paper mentions software tools like "lm-evaluation-harness [8] framework" and "AdamW optimizer" but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch, CUDA versions) which are required for a reproducible description of ancillary software.
Experiment Setup	Yes	All methods are evaluated under sparsity levels of 10%, 20%, and 30%. ... Co Me is conducted on LLa MA-2-7b using the Wiki-2 calibration set (256 samples) with a default sparsity of 30%. During the layer pruning process, we set the number of layers merged per iteration to 2 (i.e., m = 1). ... For the Qwen2.5-7b model, p is set to 32. ... The values of ρ for the Mistral-7b, Qwen2.5-7b, and LLa MA-3-8b models are set to 0.97, 0.85, and 0.97, respectively. ... For optimization, we utilize the Adam W optimizer with a weight decay coefficient of 1e 2 and implement cosine decay for learning rate scheduling. The Co Me-sp employs a fixed learning rate of 1e 5. The Co Me-mp adopts layer-specific decaying rates during multi-layer distillation, with learning rates progressively decreasing from the shallow to the deep layers as follows: 5e 4, 2.5e 4, 1e 4, 7.5e 5, 5e 5, 2.5e 5, and 1e 5 for LLa MA-2-7b; 5e 4, 2.5e 4, 5e 5, 2.5e 5, 1e 5, and 7.5e 6 for Qwen-3-4b. Table 7 summarizes the post-training settings for all methods (e.g., # Iterations, # Epochs, # Steps, Batch size, Token length).