Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

COSEE: Consistency-Oriented Signal-Based Early Exiting via Calibrated Sample Weighting Mechanism

Authors: Jianing He, Qi Zhang, Hongyun Zhang, Xuanjing Huang, Usman Naseem, Duoqian Miao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the GLUE benchmark demonstrate the effectiveness of our COSEE across multiple exiting signals and backbones, yielding a better trade-off between performance and efficiency.
Researcher Affiliation	Academia	Jianing He1, Qi Zhang1, Hongyun Zhang1, Xuanjing Huang2, Usman Naseem3, Duoqian Miao1* 1Tongji University, China 2Fudan University, China 3Macquarie University, Australia EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in prose, without explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/He-Jianing/COSEE
Open Datasets	Yes	we evaluate COSEE on six classification tasks from the GLUE benchmark (Wang et al. 2019), including SST-2, MRPC, QNLI, RTE, QQP, and MNLI.
Dataset Splits	Yes	Data statistics are shown in Table 1. Table 1: Dataset Statistics. Dataset Classes \|Train\| \|Test\| Task SST-2 2 67k 1.8k Sentiment MRPC 2 3.7k 1.7k Paraphrase QQP 2 364k 391k Paraphrase MNLI 3 393k 20k NLI QNLI 2 105k 5.4k QA/NLI RTE 2 2.5k 3k NLI
Hardware Specification	Yes	We conduct experiments on two RTX4090 GPUs with 24GB.
Software Dependencies	No	Our implementation is based on Hugging Face s Transformers (Wolf et al. 2020). - This mentions a software package but lacks specific version numbers for its dependencies or the package itself.
Experiment Setup	Yes	We perform a grid search over learning rates of {1e-5, 2e-5, 3e-5, 5e-5}, batch sizes of {16, 32, 128}, α values in Eq.(9) of {0.001, 0.01, 0.1, 1.0}, and β0 values in Eq.(4) of {0.05, 0.2, 1.0, 10.0}. We set ϵ to 0.3 in Eq.(7) and K to 5 in Eq.(5). The maximum sequence length is fixed at 128. We employ a linear decay learning rate scheduler and the Adam W optimizer (Loshchilov and Hutter 2019).