Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
COSEE: Consistency-Oriented Signal-Based Early Exiting via Calibrated Sample Weighting Mechanism
Authors: Jianing He, Qi Zhang, Hongyun Zhang, Xuanjing Huang, Usman Naseem, Duoqian Miao
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the GLUE benchmark demonstrate the effectiveness of our COSEE across multiple exiting signals and backbones, yielding a better trade-off between performance and efficiency. |
| Researcher Affiliation | Academia | Jianing He1, Qi Zhang1, Hongyun Zhang1, Xuanjing Huang2, Usman Naseem3, Duoqian Miao1* 1Tongji University, China 2Fudan University, China 3Macquarie University, Australia EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology in prose, without explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/He-Jianing/COSEE |
| Open Datasets | Yes | we evaluate COSEE on six classification tasks from the GLUE benchmark (Wang et al. 2019), including SST-2, MRPC, QNLI, RTE, QQP, and MNLI. |
| Dataset Splits | Yes | Data statistics are shown in Table 1. Table 1: Dataset Statistics. Dataset Classes |Train| |Test| Task SST-2 2 67k 1.8k Sentiment MRPC 2 3.7k 1.7k Paraphrase QQP 2 364k 391k Paraphrase MNLI 3 393k 20k NLI QNLI 2 105k 5.4k QA/NLI RTE 2 2.5k 3k NLI |
| Hardware Specification | Yes | We conduct experiments on two RTX4090 GPUs with 24GB. |
| Software Dependencies | No | Our implementation is based on Hugging Face s Transformers (Wolf et al. 2020). - This mentions a software package but lacks specific version numbers for its dependencies or the package itself. |
| Experiment Setup | Yes | We perform a grid search over learning rates of {1e-5, 2e-5, 3e-5, 5e-5}, batch sizes of {16, 32, 128}, α values in Eq.(9) of {0.001, 0.01, 0.1, 1.0}, and β0 values in Eq.(4) of {0.05, 0.2, 1.0, 10.0}. We set ϵ to 0.3 in Eq.(7) and K to 5 in Eq.(5). The maximum sequence length is fixed at 128. We employ a linear decay learning rate scheduler and the Adam W optimizer (Loshchilov and Hutter 2019). |