Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Latent Principle Discovery for Language Model Self-Improvement

Authors: Keshav Ramji, Tahira Naseem, Ramón Astudillo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in Alpaca Eval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We validate the efficacy of our method over several iterations on instruction-following benchmarks including MT-Bench (Zheng et al., 2023) and Alpaca Eval (Li et al., 2023), and leverage Prometheusv2.0 (Kim et al., 2024) to analyze win-rates with fine-grained, principle-following rubrics.
Researcher Affiliation	Industry	Keshav Ramji , Tahira Naseem, Ramón Fernandez Astudillo IBM Research AI
Pseudocode	Yes	A Formal Description of STa PLe Algorithm We provide a full, formal description of the STa PLe algorithm below. We use y1 and y2 notationally to avoid confusion with the sample indices. We use general variables for components which may be ablated on: the similarity function f, clustering algorithm C and label replacement scheme R. We leave the M-step in terms of the dataset D for generality, although if clustering were to be performed, one would use e D instead. Algorithm 1 Self-Taught Principle Learning (STa PLe)
Open Source Code	No	We also will publicly release the code for the STa PLe algorithm, to further facilitate reproducibility of our self-improvement method.
Open Datasets	Yes	We form a corpus of 100k samples for the principle discovery phase, consisting of four datasets: Anthropic HH-RLHF (Bai et al., 2022a), Ultra Feedback (Cui et al., 2024), TL;DR (Stiennon et al., 2020), and Hotpot QA (Yang et al., 2018), taken in equal proportion (i.e. 25k samples of each dataset, drawn randomly) and deduplicated by prompt. For all datasets, we use the existing, publicly-available human-annotated gold responses; for preference datasets, we take the chosen response to be the gold answer y G. ... We evaluate on the MT-Bench (Zheng et al., 2023) and Alpaca Eval-2.0-LC (Li et al., 2023; Dubois et al., 2024) datasets, instruction-following evaluations designed to reflect alignment abilities of LLMs in chat settings; these are scored using the GPT-4o model (Open AI, 2024). We also use the Prometheus-8x7B-v2.0 model (Kim et al., 2024) on the IFEval (Zhou et al., 2023) dataset, for fine-grained evaluation on principle-following rubrics, with additional experiments in Appendix I.
Dataset Splits	Yes	To run STa PLe, we use the first 50k samples for iteration 1, to heavily bootstrap off the first iteration, and then use 10k samples for each iteration thereafter, such that the input prompts are unseen for each iteration. ... Table 2: Self-improvement over four iterations of the STa PLe algorithm, compared against the STa R baseline (SFT without the principle operating as a latent Co T between the initial and refined attempts). Note that the SFT sample counts for iterations 2-4 differ as the principles are discovered by different models – the STa R Iter 1 and STa PLe Iter 1 models, respectively. Numbers in parentheses denote training set size, based on the number of samples which successfully refined.
Hardware Specification	Yes	We use 4 H100 Nvidia GPUs for the principle discovery phase, with a separate v LLM (Kwon et al., 2023) instance per GPU. ... All experiments were performed on 8 H100 Nvidia GPUs.
Software Dependencies	No	We use the all-Mini LML6-v2 model (Sentence Transformers, 2021) from Sentence Transformers as the embedding model to compute medoids in our clustering approach. ... We use the scikit-optimize package (Head et al., 2020) to perform Bayesian optimization via Gaussian Processes to search for an optimal value of δ over this function.
Experiment Setup	Yes	We use a Rouge-L F1 threshold of 0.4 for the similarity threshold (f(y, y G) if the initial response exceeds this threshold, we do not pursue refinement). ... We set N = 16 to balance runtime per iteration of the algorithm with sufficient exploration of diverse principles. During principle discovery, we sample principles, critiques, and responses at a temperature of 0.7; the maximum number of tokens for principle proposal and critique is set at 500, and is set at 1024 for the refined response. ... We perform full supervised fine-tuning for 3 epochs at a learning rate of 1 10 6 with the Adam W optimizer (Loshchilov and Hutter, 2019), with a sequence length of 4096.