Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Test-time Collective Prediction
Authors: Celestine Mendler-Dünner, Wenshuo Guo, Stephen Bates, Michael Jordan
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the empirical side, we demonstrate the efficacy of our mechanism through extensive numerical experiments across different learning scenarios. In particular, we illustrate the mechanism s advantages over model averaging as well as model selection, and demonstrate that it consistently outperforms alternative non-uniform combination schemes that have access to additional validation data across a wide variety of models and datasets. |
| Researcher Affiliation | Academia | Celestine Mendler-Dünner MPI for Intelligent Systems, Tübingen EMAIL Wenshuo Guo University of California, Berkeley EMAIL Stephen Bates University of California, Berkeley EMAIL Michael I. Jordan University of California, Berkeley EMAIL |
| Pseudocode | Yes | Algorithm 1 De Groot Aggregation |
| Open Source Code | No | The paper does not contain an unambiguous statement that the authors are releasing the source code for the work described, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We work with the abalone dataset [Nash et al., 1994]... Datasets have been downloaded from [Fan, 2011]. |
| Dataset Splits | No | The paper describes how individual agents use local data for validation (e.g., 'Construct local validation dataset Di(x ) using N-nearest neighbors of x in Di.') and compares against methods using 'additional validation data', but it does not provide specific, global train/validation/test dataset splits (e.g., percentages or exact counts) for the overall experiments needed for reproduction. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU/CPU models or memory specifications. |
| Software Dependencies | No | The paper mentions software like 'scikit-learn' and 'Python' but does not specify version numbers for these or other key software components used in the experiments. |
| Experiment Setup | Yes | Unless stated otherwise, we use K = 5 agents and let each agent fit a linear model to her local data... We use N = 5 for local cross-validation in De Groot... For our first experiment... we train a lasso model on each agent with regularization parameter λk = λ that achieves a sparsity of 0.8... We choose N to be 1% of the data partition for all schemes (with a had lower bound at 2). |