Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
COSEE: Consistency-Oriented Signal-Based Early Exiting via Calibrated Sample Weighting Mechanism
Authors: Jianing He, Qi Zhang, Hongyun Zhang, Xuanjing Huang, Usman Naseem, Duoqian Miao
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the GLUE benchmark demonstrate the effectiveness of our COSEE across multiple exiting signals and backbones, yielding a better trade-off between performance and efficiency. |
| Researcher Affiliation | Academia | Jianing He1, Qi Zhang1, Hongyun Zhang1, Xuanjing Huang2, Usman Naseem3, Duoqian Miao1* 1Tongji University, China 2Fudan University, China 3Macquarie University, Australia EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology in prose, without explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/He-Jianing/COSEE |
| Open Datasets | Yes | we evaluate COSEE on six classification tasks from the GLUE benchmark (Wang et al. 2019), including SST-2, MRPC, QNLI, RTE, QQP, and MNLI. |
| Dataset Splits | Yes | Data statistics are shown in Table 1. Table 1: Dataset Statistics. Dataset Classes |Train| |Test| Task SST-2 2 67k 1.8k Sentiment MRPC 2 3.7k 1.7k Paraphrase QQP 2 364k 391k Paraphrase MNLI 3 393k 20k NLI QNLI 2 105k 5.4k QA/NLI RTE 2 2.5k 3k NLI |
| Hardware Specification | Yes | We conduct experiments on two RTX4090 GPUs with 24GB. |
| Software Dependencies | No | Our implementation is based on Hugging Face s Transformers (Wolf et al. 2020). - This mentions a software package but lacks specific version numbers for its dependencies or the package itself. |
| Experiment Setup | Yes | We perform a grid search over learning rates of {1e-5, 2e-5, 3e-5, 5e-5}, batch sizes of {16, 32, 128}, α values in Eq.(9) of {0.001, 0.01, 0.1, 1.0}, and β0 values in Eq.(4) of {0.05, 0.2, 1.0, 10.0}. We set ϵ to 0.3 in Eq.(7) and K to 5 in Eq.(5). The maximum sequence length is fixed at 128. We employ a linear decay learning rate scheduler and the Adam W optimizer (Loshchilov and Hutter 2019). |