Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Aligning Language Models with Human Preferences via a Bayesian Approach

Authors: Jiashuo WANG, Haozhao Wang, Shichao Sun, Wenjie Li

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on two human-centric NLG tasks, i.e., emotional support conversation and integrity Rule-of-Thumb generation, show that our method consistently exceeds previous SOTA models in both automatic and human evaluations.
Researcher Affiliation	Academia	1Department of Computing, The Hong Kong Polytechnic University 2School of Computer Science and Technology, Huazhong University of Science and Technology
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our codes are released at https://github.com/wangjs9/Aligned-d PM.
Open Datasets	Yes	Dataset and Base Models The benchmark ESConv [22], containing approximately 1k conversations with 31k utterances... We derive human preferences from the Motivational Interviewing-Dataset [36]... The MIC dataset [42] comprises about 99k distinct Ro Ts...
Dataset Splits	Yes	The dataset was randomly split into a 9 : 1 ratio for the training and validation set.
Hardware Specification	Yes	We trained models based on Multi ESC using two Nvidia RTX 3092 GPUs, while all other models, including d-PM models, were trained using a single NVIDIA RTX 3092 GPU.
Software Dependencies	Yes	Our models were implemented in Python using Py Torch3 and the transformers (4.16.2) library4.
Experiment Setup	Yes	When training the aligned models, we aim to retain the same hyperparameters used in the training of the base models. We set the candidate number K to 10. We train each aligned model five times with five different seeds. Subsequently, we test each of the five trained models on the test dataset and compute the average results. ... We set the learning rate to 1 10 3 for the Blender-Vanilla and Blender-Joint base models, and 3 10 5 for the other models. Additionally, due to GPU memory constraints, we reduced the batch size from 32 to 12 when training the aligned Multi ESC. ... The prefix length was set to 10. A batch size of 160 and a learning rate of 5 10 4 were used.