Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scalable Exploration via Ensemble++

Authors: Yingru Li, Jiawei Xu, Baoxiang Wang, Zhiquan Luo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments across linear, quadratic, neural, and GPT-based contextual bandits validate our theoretical findings and demonstrate Ensemble++ s superior regret-computation tradeoff versus state-of-the-art methods. Empirical Validation: Through comprehensive experiments on synthetic and real-world benchmarks including quadratic bandits and large-scale neural tasks involving GPTs we demonstrate that Ensemble++ achieves superior regret-vs-computation trade-offs compared to leading baselines such as Ensemble+ [Osband et al., 2018, 2019] and Epi Net [Osband et al., 2023a,b] (see Fig. 1 and Section 5); and validate the theoretical results of linear Ensemble++ sampling.
Researcher Affiliation	Academia	1The Chinese University of Hong Kong, Shenzhen 2Shenzhen Research Institute of Big Data
Pseudocode	Yes	Algorithm 1 Ensemble++
Open Source Code	Yes	Code: https://github.com/szrlee/Ensemble_Plus_Plus.
Open Datasets	Yes	UCI Shuttle: Following Riquelme et al. [2018], Kveton et al. [2020b], we create contextual bandits for N-class classification using the UCI Shuttle dataset Asuncion et al. [2007]. Online Hate Speech Detection: Built using a language dataset3. The agent decides to publish (reward 1 for free , -0.5 for hate ) or block content (reward 0.5). Footnote 3: https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech
Dataset Splits	No	The paper describes environments and tasks (e.g., Finite-action Linear Bandit, Quadratic Bandit, Neural Bandit, UCI Dataset, Online Hate Speech Detection) which typically involve sequential data generation rather than fixed train/test/validation splits. For example, for the UCI Dataset it describes how contexts are constructed but not how the overall dataset is split into training/testing subsets: 'UCI Dataset: Following prior works [Riquelme et al., 2018, Kveton et al., 2020b], we conduct contextual bandits for N-class classification using the UCI datasets [Asuncion et al., 2007] Mushroom and Shuttle. Specifically, given a data feature x Rd in the dataset, we construct context vectors for N arms, such as x(1) = (x, 0, . . . , 0), . . . , x(N) = (0, . . . , 0, x) RNd. Only the arm x(j) where j matches the correct class of this data x has a reward of 1, while all other arms have a reward of 0.' No explicit dataset split percentages or counts are provided for any task.
Hardware Specification	Yes	All experiments are conducted on P40 GPUs to maintain processing standardization. For the hate speech detection experiments with foundation models, the paper mentions that V100 GPUs were used.
Software Dependencies	No	The paper mentions models like 'GPT-2' and 'Pythia14m' and platforms like 'Hugging Face' for datasets, but it does not provide specific version numbers for programming languages, libraries, or frameworks used for implementation (e.g., Python, PyTorch, TensorFlow, or CUDA versions).
Experiment Setup	Yes	For the practical implementation of Ensemble++, we utilize a 2-layer MLP with 64 units and Re LU activation to construct the feature extractor h(x; w). The ensemble size is set to M = 8, and we use a symmetrized slack variable β = 0.01 and weight decay λ = 0.01 across all nonlinear bandit tasks. During training, a unified batch size of 128 and a learning rate of 0.0001 are employed for all tasks.