Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Contextual Bandit Bake-off

Authors: Alberto Bietti, Alekh Agarwal, John Langford

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We leverage the availability of large numbers of supervised learning datasets to empirically evaluate contextual bandit algorithms, focusing on practical methods that learn by relying on optimization oracles from supervised learning. ... The main objective of our work is an evaluation of practical methods that are relevant to practitioners.
Researcher Affiliation	Collaboration	Alberto Bietti EMAIL Center for Data Science, New York University, New York, NY Alekh Agarwal EMAIL Microsoft Research, Redmond, WA John Langford EMAIL Microsoft Research, New York, NY
Pseudocode	Yes	Algorithm 1 Generic contextual bandit algorithm Algorithm 2 ϵ-greedy Algorithm 3 Bag / Online BTS Algorithm 4 Cover Algorithm 5 Reg CB
Open Source Code	Yes	The evaluation code is available at https://github.com/albietz/cb_bakeoff. All methods presented in this section are available in Vowpal Wabbit. For reproducibility purposes, the precise version of VW used to run these experiments is available at https://github.com/albietz/vowpal_wabbit/tree/bakeoff.
Open Datasets	Yes	We consider a large collection of over 500 datasets with varying characteristics and various cost structures, including multiclass, multilabel and more general cost-sensitive datasets with real-valued costs. ... We consider a collection of 516 multiclass classiﬁcation datasets from the openml.org platform... We consider 5 multilabel datasets from the Lib SVM website... Microsoft Learning to Rank dataset, variant MSLR-30K at https://www.microsoft.com/en-us/research/project/mslr/, and the Yahoo! Learning to Rank Challenge V2.0, variant C14B at https://webscope.sandbox.yahoo.com/catalog.php?datatype=c. The datasets we used can be accessed at https://www.openml.org/d/<id>, with id in the following list: [list of IDs]
Dataset Splits	No	Because of the online setup, we consider one or more ﬁxed, shuﬄed orderings of each dataset. The datasets widely vary in noise levels, and number of actions, features, examples etc., allowing us to model varying diﬃculties in CB problems. ... The performance of method A on a dataset of size n is measured by the progressive validation loss (Blum et al., 1999): T t=1 ct(at), where at is the action chosen by the algorithm on the t-th example, and ct the true cost vector.
Hardware Specification	No	The paper does not explicitly mention specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments. It refers to 'Vowpal Wabbit' as an online learning system and discusses online learning in production systems, but not the experimental hardware.
Software Dependencies	No	All of our experiments are based on the online learning system Vowpal Wabbit which has already been successfully used in production systems (Agarwal et al., 2016). ... For reproducibility purposes, the precise version of VW used to run these experiments is available at https://github.com/albietz/vowpal_wabbit/tree/bakeoff. While a link to a specific branch of Vowpal Wabbit is provided, the paper does not explicitly state a version number (e.g., 8.9.0) for Vowpal Wabbit or any other ancillary software dependencies in the text.
Experiment Setup	Yes	We ran each method on every dataset with diﬀerent choices of algorithm-speciﬁc hyperparameters, learning rates, reductions, and loss encodings. Details are given in Appendix C.1. Appendix C.1. Algorithms and Hyperparameters: We ran each method on every dataset with the following hyperparameters: algorithm-speciﬁc hyperparameters, shown in Table 9. 9 choices of learning rates, on a logarithmic grid from 0.001 to 10... 3 choices of reductions: IPS, DR and IWR... 3 choices of loss encodings: 0/1, -1/0 and 9/10 (see Eq. (7)).