Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Risk-Controlling Model Selection via Guided Bayesian Optimization

Authors: Bracha Laufer-Goldshtein, Adam Fisch, Regina Barzilay, Tommi Jaakkola

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our approach on a range of tasks with multiple desiderata, including low error rates, equitable predictions, handling spurious correlations, managing rate and distortion in generative models, and reducing computational costs.1 ... Through empirical experiments, we demonstrate that Guide BO selects highly efficient and verified configurations under practical budget constraints, outperforming baselines.
Researcher Affiliation	Collaboration	Bracha Laufer-Goldshtein EMAIL Department of Electrical Engineering Tel-Aviv University; Adam Fisch EMAIL Google Deep Mind; Regina Barzilay EMAIL Computer Science and Artificial Intelligence Laboratory (CSAIL) Massachusetts Institute of Technology
Pseudocode	Yes	Algorithm 1 Guide BO: Testing Guided Bayesian Optimization; Algorithm C.1 Configuration Selection
Open Source Code	Yes	Our code is available at https://github.com/bracha-laufer/guidebo.
Open Datasets	Yes	Fairness. We use the Adult (Dua et al., 2017) dataset... Robustness + Robust and selective classification. We use Celeb A (Lin et al., 2019)... VAE. We use the MNIST dataset (Le Cun, 1998)... Pruning. We use AG News (Zhang et al., 2015) dataset... Early-Time Classification. We use the Qu ALITY dataset (Pang et al., 2022)
Dataset Splits	Yes	The selection of λ is carried out based on two disjoint data subsets: (i) a validation set Dval = {Xi, Yi}k i=1 and (ii) a calibration set Dcal = {Xi, Yi}k+m i=k+1. ... The training data is used to learn the model parameters. The validation data is used for selecting candidate hyperparameter configurations... The calibration data is used for the FST procedure... Lastly, the performance of the selected model is assessed on the test dataset. ... Table 1: Datasets Details (lists Train, Validation, Calibration, Test sample counts). ... Each dataset is partitioned into four distinct parts: 80% for training and the remaining 20% are equally divided to validation, calibration and test subsets.
Hardware Specification	No	The paper mentions using specific models like "Res Net-50 model", "BERT-base model", "Vicuna-13B model", and "LSTM model" but does not provide any specific hardware details such as GPU models, CPU types, or memory used to run these experiments.
Software Dependencies	No	The paper mentions optimizers like "Adam optimizer" and "Adam W optimizer" and uses "Smac3 implementation" for a baseline. However, it does not specify version numbers for these or any other key software libraries (e.g., Python, PyTorch, TensorFlow, scikit-learn), which are necessary for full reproducibility.
Experiment Setup	Yes	Our model is a 3-layer feed-forward neural network with hidden dimensions [60, 25]. We train all models using Adam optimizer with learning rate 1e-3 for 50 epochs and batch size 256. ... We train the models for 50 epochs with SGD with a constant learning rate of 1e-3, momentum decay of 0.9, batch size 32 and weight decay of 1e-4. ... We set the learning to 1e-4 and the batch size to 64. The training process consisted of 10 epochs. ... A standard LSTM is used for feature extraction with one recurrent layer with a hidden size of 32, except for Walking Sitting Standing where the model consists of 2 recurrent layers, each with a hidden size of 256. ... The models are trained with Adam optimizer, with a learning rate of 0.001, and a batch size of 64.