Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Stackelberg Self-Annotation: A Robust Approach to Data-Efficient LLM Alignment
Authors: Chu Xu, Zhixin Zhang, Tianyu Jia, Yujie Jin
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present an extensive empirical evaluation of our proposed Stackelberg Self-Annotated Preference Optimization (SSAPO) algorithm. We introduce the basic experiment setup in this subsection (Cf. Appendix G for more details). The settings are mostly consistent to the recent literature Kim et al. [9]. Datasets. We used the Ultra Feedback dataset [18], containing 60K samples. A seed of 2K human-labeled preferences (3.3% of total 60K data) was used for initial training. The rest (58K samples) were split into three subsets (8K, 20K, and 30K) for self-annotation in iterative stages. Models. We use the supervised fine-tuned Mistral-7B-0.1 [19] as the initial model ฯinit and LLaMA-3-8B2 for compatibility checks. All models are fine-tuned on Ultra Chat [20]. Evaluations. We use Alpaca Eval 2.0 [14] for instruction-following tasks and MT-Bench [15] to evaluate multi-turn performance across tasks like math, coding, and writing. |
| Researcher Affiliation | Academia | Xu Chu 1,2,3, Zhixin Zhang1,3, Tianyu Jia1,3, Yujie Jin1,3 1Key Laboratory of High Confidence Software Technologies, Ministry of Education 2Center on Frontiers of Computing Studies, Peking University 3School of Computer Science, Peking University Corresponding author. Contact E-mail: EMAIL |
| Pseudocode | Yes | Algorithm 1 Stackelberg Self-Annotated Preference Optimization (SSAPO) |
| Open Source Code | Yes | https://github.com/Eun Tilofy/SSAPO |
| Open Datasets | Yes | Datasets. We used the Ultra Feedback dataset [18], containing 60K samples. |
| Dataset Splits | Yes | A seed of 2K human-labeled preferences (3.3% of total 60K data) was used for initial training. The rest (58K samples) were split into three subsets (8K, 20K, and 30K) for self-annotation in iterative stages. |
| Hardware Specification | Yes | For all experiments, we utilized 4 A800 GPUs. |
| Software Dependencies | No | The paper mentions using Sequential Least Squares Programming (SLSQP) for optimization but does not provide a specific version number for it, nor does it list versions for other key software components or libraries like Python or PyTorch. |
| Experiment Setup | Yes | Implementation. We initialize training with DPO on 2K seed samples, followed by 3 iterative stages of self-annotation. In each stage, new preferences are generated via a policy that ranks response pairs. A distributionally robust optimization (DRO) is performed using sequential least squares programming (SLSQP) to adjust the model based on adversarial shifts within a Wasserstein ball. The group size G for parallel computation is set to 100 unless otherwise specified. Hyper-parameters for Different LLMs. For Mistral-7B-0.1, We set learning rate = 5 10-7 and DPO hyper-parameter ฮฒ = 0.1 throughout the entire preference learning process. We conduct 3 epoch for the initial DPO training and 3 iteration for SSAPO game play (leader-follower updates). For LLaMA-3-8B, We set learning rate=1 10-6 and DPO hyper-parameter ฮฒ =0.05 throughout the entire preference learning process. We conduct 1 epoch for the initial DPO training and 2 iteration for SSAPO game play (leader-follower updates). |