Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Shape it Up! Restoring LLM Safety during Finetuning

Authors: ShengYun Peng, Pin-Yu Chen, Jianfeng Chi, Seongmin Lee, Duen Horng Chau

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DSS in a realistic finetuning-as-a-service setting, where a provider starts from an aligned model and aims to ensure that the LLM finetuned on user data maintains the original model’s safety. This goal reflects the deployment scenario described in Sec. 2 and motivates our evaluation design. In Sec. 6.1, we describe the evaluation setup. In Sec. 6.2, we assess DSS across representative finetuning risk scenarios. Sec. 6.3 evaluates its generalization across LLMs, guardrails, harm levels, and datasets, and Sec. 6.4 examines its robustness to broader risks a service provider may encounter.
Researcher Affiliation	Collaboration	Sheng Yun Peng1 Pin-Yu Chen2 Jianfeng Chi3 Seongmin Lee1 Duen Horng Chau1 1Georgia Tech 2IBM Research 3Meta EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes the methodology using mathematical equations and textual explanations, particularly in Section 5.1 "DSS Loss Function Design", but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is publicly available at https://github.com/poloclub/star-dss.
Open Datasets	Yes	Datasets & Metrics. We evaluate safety on HEx-PHI [9] and Adv Bench [35], and capability on MMLU [36] and ARC-Challenge (ARC-C) [37]. Safety is measured as the percentage of responses judged safe by GPT-4o, and capability is measured by accuracy. For harmful finetuning, we use Pure Bad [9], Beaver Tails [50], and Anthropic HH-RLHF [51]. GSM8K [43] is used for capability finetuning with 8-shot evaluation [27, 52]. Safe Instruct [14] provides the safe training samples.
Dataset Splits	Yes	We evaluate safety on HEx-PHI [9] and Adv Bench [35], and capability on MMLU [36] and ARC-Challenge (ARC-C) [37]. Safety is measured as the percentage of responses judged safe by GPT-4o, and capability is measured by accuracy.
Hardware Specification	Yes	Experiments were conducted on a single node with 8 A40 GPUs.
Software Dependencies	No	The paper mentions software components implicitly through the description of the methodology (e.g., use of Adam W optimizer), but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	Table 6: Hyperparameters used for all finetuning experiments unless otherwise specified. Hyperparameter Value Optimizer Adam W Adam betas (0.9, 0.95) Learning rate 5e-6 Weight decay 0 Batch size (per device) 4 Gradient accumulation steps 1 Max sequence length 2048 Learning rate scheduler Cosine with warmup Warmup ratio 3% Number of epochs 10 KL loss scaling (λ) 0.5 Chunk length (M) for STAR 5