Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CARE: Decoding-Time Safety Alignment via Rollback and Introspection Intervention
Authors: Xiaomeng Hu, Fei Huang, Chenhan Yuan, Junyang Lin, Tsung-Yi Ho
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our framework achieves a superior balance of safety, quality, and efficiency, attaining a low harmful response rate and minimal disruption to the user experience while maintaining high response quality. |
| Researcher Affiliation | Collaboration | 1Qwen Team, Alibaba Group 2The Chinese University of Hong Kong |
| Pseudocode | No | The paper describes methods and mechanisms using numbered steps and textual explanations, but it does not include a distinct block labeled 'Pseudocode' or 'Algorithm', nor does it present procedures in a code-like structured format. |
| Open Source Code | No | Releasing our code requires additional approval from the authors organization. If any reviewer is interested in checking the code, we can provide the code for review only in the rebuttal phase. |
| Open Datasets | Yes | We test our framework on the Beaver Tails dataset [13], a benchmark specifically designed to test the safety and quality of LLM responses in diverse scenarios. |
| Dataset Splits | No | We test our framework on the Beaver Tails dataset [13], a benchmark specifically designed to test the safety and quality of LLM responses in diverse scenarios. |
| Hardware Specification | Yes | Inference was performed on a single NVIDIA A100 80GB GPU. |
| Software Dependencies | No | All large language models (LLMs) used in our experiments were loaded via the Hugging Face transformers library, using the standard Auto Model For Causal LM interface. |
| Experiment Setup | Yes | The model is used with its default configuration settings: the repetition penalty is set to 1.05, the temperature is set to 0.7, and the top-p and top-k sampling parameters are configured to 0.8 and 20, respectively. |