Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Not All Layers of LLMs Are Necessary During Inference

Authors: Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on well-known LLMs like the Llama2 series and OPT, show that Ada Infer can achieve an average of 17.8% pruning ratio, and up to 43% on sentiment tasks, with nearly no performance drop (<1%). The paper also includes a dedicated section titled '5 Experiments' detailing experimental settings, main results, and analysis.
Researcher Affiliation	Academia	1University of Electronic Science and Technology of China, Chengdu, China 2Beijing Academy of Artificial Intelligence, Beijing, China 3Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 4School of Computer Science and Engineering, Nanyang Technological University, Singapore
Pseudocode	No	The paper describes the Ada Infer algorithm in Section 4 and illustrates its workflow in Figure 2a, but does not present it in a structured pseudocode or algorithm block.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code or a link to a code repository for the Ada Infer methodology.
Open Datasets	Yes	Question Answering Tasks. (1) MMLU [Hendrycks et al., 2021]... (2) Commonsense QA [Talmor et al., ]... (3) SQuAD [Rajpurkar et al., 2016]... Text Classification Tasks. (1) SST-2 [Socher et al., 2013]... (2) AG News [Zhang et al., 2015]...
Dataset Splits	No	The paper refers to using 'test set' and 'training set examples' for evaluation and in-context learning, and mentions 'sample sizes of 5, 10, 15, and 20' for few-shot scenarios. However, it does not provide specific dataset split percentages, explicit sample counts for train/validation/test sets, or detailed partitioning methodology.
Hardware Specification	Yes	Table 3 compares the runtime of Ada Infer with a dense implementation on MMLU and Sentiment tasks (5-shot, batch size set to 1), using 6 V100 (32GB).
Software Dependencies	No	The paper states, 'We utilized the sklearn library for training SVM1 and CRF2, adhering to their default configurations,' but does not provide specific version numbers for these libraries.
Experiment Setup	Yes	In-Context Learning Setting. We evaluate Ada Infer under zero-shot and few-shot scenarios, using sample sizes of 5, 10, 15, and 20. ... For in-context learning prompts, we use a default template: Q : {xk}\n A : {yk}\n\n, concatenating random xk and yk samples from task-specific training sets. ... We utilized the sklearn library for training SVM1 and CRF2, adhering to their default configurations. ... batch size set to 1, using 6 V100 (32GB).