reproducibilityindex.ai

Deep Bayesian Active Learning for Preference Modeling in Large Language Models

Authors: Luckeciano Carvalho Melo, Panagiotis Tigas, Alessandro Abate, Yarin Gal

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct active learning experiments in the Reddit and CNN/DM preference datasets [25, 26, 1] to validate our method.
Researcher Affiliation	Academia	1 OATML, University of Oxford 2 OXCAV, University of Oxford
Pseudocode	Yes	Algorithm 1 BAL-PM
Open Source Code	Yes	To ensure the reproducibility of our research findings, we release our code at https://github.com/luckeciano/BAL-PM.
Open Datasets	Yes	We considered prompts from the Reddit TL;DR dataset of Reddit posts [25] and the CNN/DM News dataset [26]. We leverage the generated completions and human feedback collected by Stiennon et al. [1].
Dataset Splits	Yes	The Reddit dataset contains train/eval/test splits, and we adopt the train split (92,858 points) for the pool and training sets, the eval split (33,083 points) for validation, and report results in the test set (50,719 points).
Hardware Specification	Yes	We execute all active learning experiments in a single A100 GPU, and each experiment takes approximately one day. For the base LLM feature generation, we also use a single A100 GPU...
Software Dependencies	No	Our implementation is based on Py Torch [58] and Hugging Face [59]. The paper mentions software frameworks but does not provide specific version numbers for them.
Experiment Setup	Yes	In Table 2, we share all hyperparameters used in this work. We specifically performed a hyperparameter search on the entropy term parameters and baselines. The search strategy was a simple linear search on the options in Table 1, considering each parameter in isolation. The selection followed the final performance on a held-out validation set. For baselines, we mostly considered the values presented in prior work [22]. For the proposed method, we also considered d X as a hyperparameter and found that smaller values often work better than using the dimensionality of the base LLM embeddings.