Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Stackelberg Self-Annotation: A Robust Approach to Data-Efficient LLM Alignment

Authors: Chu Xu, Zhixin Zhang, Tianyu Jia, Yujie Jin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present an extensive empirical evaluation of our proposed Stackelberg Self-Annotated Preference Optimization (SSAPO) algorithm. We introduce the basic experiment setup in this subsection (Cf. Appendix G for more details). The settings are mostly consistent to the recent literature Kim et al. [9]. Datasets. We used the Ultra Feedback dataset [18], containing 60K samples. A seed of 2K human-labeled preferences (3.3% of total 60K data) was used for initial training. The rest (58K samples) were split into three subsets (8K, 20K, and 30K) for self-annotation in iterative stages. Models. We use the supervised fine-tuned Mistral-7B-0.1 [19] as the initial model πinit and LLaMA-3-8B2 for compatibility checks. All models are fine-tuned on Ultra Chat [20]. Evaluations. We use Alpaca Eval 2.0 [14] for instruction-following tasks and MT-Bench [15] to evaluate multi-turn performance across tasks like math, coding, and writing.
Researcher Affiliation	Academia	Xu Chu 1,2,3, Zhixin Zhang1,3, Tianyu Jia1,3, Yujie Jin1,3 1Key Laboratory of High Confidence Software Technologies, Ministry of Education 2Center on Frontiers of Computing Studies, Peking University 3School of Computer Science, Peking University Corresponding author. Contact E-mail: EMAIL
Pseudocode	Yes	Algorithm 1 Stackelberg Self-Annotated Preference Optimization (SSAPO)
Open Source Code	Yes	https://github.com/Eun Tilofy/SSAPO
Open Datasets	Yes	Datasets. We used the Ultra Feedback dataset [18], containing 60K samples.
Dataset Splits	Yes	A seed of 2K human-labeled preferences (3.3% of total 60K data) was used for initial training. The rest (58K samples) were split into three subsets (8K, 20K, and 30K) for self-annotation in iterative stages.
Hardware Specification	Yes	For all experiments, we utilized 4 A800 GPUs.
Software Dependencies	No	The paper mentions using Sequential Least Squares Programming (SLSQP) for optimization but does not provide a specific version number for it, nor does it list versions for other key software components or libraries like Python or PyTorch.
Experiment Setup	Yes	Implementation. We initialize training with DPO on 2K seed samples, followed by 3 iterative stages of self-annotation. In each stage, new preferences are generated via a policy that ranks response pairs. A distributionally robust optimization (DRO) is performed using sequential least squares programming (SLSQP) to adjust the model based on adversarial shifts within a Wasserstein ball. The group size G for parallel computation is set to 100 unless otherwise specified. Hyper-parameters for Different LLMs. For Mistral-7B-0.1, We set learning rate = 5 10-7 and DPO hyper-parameter β = 0.1 throughout the entire preference learning process. We conduct 3 epoch for the initial DPO training and 3 iteration for SSAPO game play (leader-follower updates). For LLaMA-3-8B, We set learning rate=1 10-6 and DPO hyper-parameter β =0.05 throughout the entire preference learning process. We conduct 1 epoch for the initial DPO training and 2 iteration for SSAPO game play (leader-follower updates).