Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CREAM: Consistency Regularized Self-Rewarding Language Models

Authors: Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, Huaxiu Yao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results from relatively small LLMs (e.g., 7B parameters) also indicate that improvements from self-rewarding may diminish after several iterations in certain situations... With this explicit regularization, our empirical results demonstrate the superiority of CREAM in improving both reward consistency and alignment performance. The code is publicly available at https://github.com/Raibows/CREAM. Table 1: Main results of each method on test sets of downstream tasks.
Researcher Affiliation	Collaboration	1University of North Carolina at Chapel Hill 2Nanyang Technological University 3National University of Singapore 4Microsoft Research
Pseudocode	Yes	Algorithm 1 Consistency-Regularized Self-Rewarding Language Model
Open Source Code	Yes	The code is publicly available at https://github.com/Raibows/CREAM.
Open Datasets	Yes	In our experiments, we use Open Assistant dataset (K opf et al., 2024) and only reserve about 3.4K human-annotated examples as the seed SFT data DS. To construct the unlabeled prompt dataset DU, we mix prompts of DS with the train split of each downstream task including (1) ARC-Easy/Challenge (Clark et al., 2018), (2) Open Book QA (Mihaylov et al., 2018), (3) SIQA (Sap et al., 2019), and (4) GSM8K (Cobbe et al., 2021).
Dataset Splits	No	The paper mentions using 'train split of each downstream task' for prompt generation and evaluating on 'test sets of downstream tasks' (Table 1), and provides absolute counts for initial SFT data (3.4K) and unlabeled prompt data (21K). However, it does not explicitly provide percentages or sample counts for the training, validation, and testing splits for its own iterative preference data generation and training process, which is crucial for full reproducibility of their DPO runs.
Hardware Specification	No	The paper mentions models were trained on 'two LLMs with about 7B parameters' and cites 'limited computational resources' but does not specify any particular hardware like GPU models, CPU types, or cloud platforms used for the experiments.
Software Dependencies	No	The paper only mentions the 'Adam W optimizer (Loshchilov & Hutter, 2019)' but does not list specific versions for programming languages, machine learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries.
Experiment Setup	Yes	In our experiments, we fine-tune the initial model (M0) on the seed SFT data for 3 epochs with a learning rate of 1e-6, resulting in model M1. Following SRLM approach, we then iteratively fine-tune the model using the preference learning objective two additional iterations, producing models M2 and M3. In the preference training of each iteration, we set β = 0.1 of DPO, and fine-tune the model for 1 epoch with a learning rate of 1e-6. All training processes use Adam W optimizer (Loshchilov & Hutter, 2019) with a warmup ratio of 0.1. For the response sampling stage of all SRLM methods, we use a decoding temperature of 0.8 and generate N = 5 responses per prompt. For evaluating downstream tasks, we use greedy decoding to generate answers.