Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

COSEE: Consistency-Oriented Signal-Based Early Exiting via Calibrated Sample Weighting Mechanism

Authors: Jianing He, Qi Zhang, Hongyun Zhang, Xuanjing Huang, Usman Naseem, Duoqian Miao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the GLUE benchmark demonstrate the effectiveness of our COSEE across multiple exiting signals and backbones, yielding a better trade-off between performance and efficiency.
Researcher Affiliation Academia Jianing He1, Qi Zhang1, Hongyun Zhang1, Xuanjing Huang2, Usman Naseem3, Duoqian Miao1* 1Tongji University, China 2Fudan University, China 3Macquarie University, Australia EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology in prose, without explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/He-Jianing/COSEE
Open Datasets Yes we evaluate COSEE on six classification tasks from the GLUE benchmark (Wang et al. 2019), including SST-2, MRPC, QNLI, RTE, QQP, and MNLI.
Dataset Splits Yes Data statistics are shown in Table 1. Table 1: Dataset Statistics. Dataset Classes |Train| |Test| Task SST-2 2 67k 1.8k Sentiment MRPC 2 3.7k 1.7k Paraphrase QQP 2 364k 391k Paraphrase MNLI 3 393k 20k NLI QNLI 2 105k 5.4k QA/NLI RTE 2 2.5k 3k NLI
Hardware Specification Yes We conduct experiments on two RTX4090 GPUs with 24GB.
Software Dependencies No Our implementation is based on Hugging Face s Transformers (Wolf et al. 2020). - This mentions a software package but lacks specific version numbers for its dependencies or the package itself.
Experiment Setup Yes We perform a grid search over learning rates of {1e-5, 2e-5, 3e-5, 5e-5}, batch sizes of {16, 32, 128}, α values in Eq.(9) of {0.001, 0.01, 0.1, 1.0}, and β0 values in Eq.(4) of {0.05, 0.2, 1.0, 10.0}. We set ϵ to 0.3 in Eq.(7) and K to 5 in Eq.(5). The maximum sequence length is fixed at 128. We employ a linear decay learning rate scheduler and the Adam W optimizer (Loshchilov and Hutter 2019).