Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
CREAM: Consistency Regularized Self-Rewarding Language Models
Authors: Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, Huaxiu Yao
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results from relatively small LLMs (e.g., 7B parameters) also indicate that improvements from self-rewarding may diminish after several iterations in certain situations... With this explicit regularization, our empirical results demonstrate the superiority of CREAM in improving both reward consistency and alignment performance. The code is publicly available at https://github.com/Raibows/CREAM. Table 1: Main results of each method on test sets of downstream tasks. |
| Researcher Affiliation | Collaboration | 1University of North Carolina at Chapel Hill 2Nanyang Technological University 3National University of Singapore 4Microsoft Research |
| Pseudocode | Yes | Algorithm 1 Consistency-Regularized Self-Rewarding Language Model |
| Open Source Code | Yes | The code is publicly available at https://github.com/Raibows/CREAM. |
| Open Datasets | Yes | In our experiments, we use Open Assistant dataset (K opf et al., 2024) and only reserve about 3.4K human-annotated examples as the seed SFT data DS. To construct the unlabeled prompt dataset DU, we mix prompts of DS with the train split of each downstream task including (1) ARC-Easy/Challenge (Clark et al., 2018), (2) Open Book QA (Mihaylov et al., 2018), (3) SIQA (Sap et al., 2019), and (4) GSM8K (Cobbe et al., 2021). |
| Dataset Splits | No | The paper mentions using 'train split of each downstream task' for prompt generation and evaluating on 'test sets of downstream tasks' (Table 1), and provides absolute counts for initial SFT data (3.4K) and unlabeled prompt data (21K). However, it does not explicitly provide percentages or sample counts for the training, validation, and testing splits for its own iterative preference data generation and training process, which is crucial for full reproducibility of their DPO runs. |
| Hardware Specification | No | The paper mentions models were trained on 'two LLMs with about 7B parameters' and cites 'limited computational resources' but does not specify any particular hardware like GPU models, CPU types, or cloud platforms used for the experiments. |
| Software Dependencies | No | The paper only mentions the 'Adam W optimizer (Loshchilov & Hutter, 2019)' but does not list specific versions for programming languages, machine learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries. |
| Experiment Setup | Yes | In our experiments, we fine-tune the initial model (M0) on the seed SFT data for 3 epochs with a learning rate of 1e-6, resulting in model M1. Following SRLM approach, we then iteratively fine-tune the model using the preference learning objective two additional iterations, producing models M2 and M3. In the preference training of each iteration, we set β = 0.1 of DPO, and fine-tune the model for 1 epoch with a learning rate of 1e-6. All training processes use Adam W optimizer (Loshchilov & Hutter, 2019) with a warmup ratio of 0.1. For the response sampling stage of all SRLM methods, we use a decoding temperature of 0.8 and generate N = 5 responses per prompt. For evaluating downstream tasks, we use greedy decoding to generate answers. |