Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Contextual Active Model Selection

Authors: Xuefeng Liu, Fangfang Xia, Rick Stevens, Yuxin Chen

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate the effectiveness and robustness of our approach on a variety of online model selection tasks spanning different application domains (from generic ML benchmarks such as CIFAR10 to domain-specific tasks in biomedical analysis), data scales (ranging from 80 to 10K), data modalities (i.e., tabular, image, and graph-based data), and label types (binary or multiclass labels). For the tasks evaluated, (1) CAMS outperforms all competing baselines by a significant margin.
Researcher Affiliation Academia Xuefeng Liu1 , Fangfang Xia2, Rick L. Stevens1,2, Yuxin Chen1 1Department of Computer Science, University of Chicago 2Argonne National Laboratory
Pseudocode Yes Figure 1: The Contextual Active Model Selection (CAMS) algorithm
Open Source Code Yes We provide the code and data in the supplementary material with a readme.txt for reproducing the results. Experiment details are listed in Section 6 and Appendix G, D.6. (from NeurIPS Paper Checklist, Section 5)
Open Datasets Yes Datasets. We evaluate our approach using five datasets: (1) CIFAR10 [41]... (2) DRIFT [73]... (3) VERTEBRAL [5]... (4) HIV [74]... (5) Cov Type [24]...
Dataset Splits No The paper mentions training and test sets but does not specify explicit validation set splits (e.g., percentages or counts) or a distinct validation phase with defined splits for hyperparameter tuning in the main experimental setup. It mentions 'randomly selected stream-size aligned data from testing-set' for online streaming.
Hardware Specification Yes We performed our experiments on a Linux server with 80 Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and total 528 Gigabyte memory.
Software Dependencies No The paper mentions software like 'VGG', 'Res Net', 'Dense Net', 'scikit-learn built-in models', but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We set 100 realizations and 3000 stream-size for DRIFT, 20 realizations and 10000 stream-size for CIFAR10, 200 realizations and 4000 stream size for HIV, 300 realization and 80 stream-size for VERTEBRAL. In each realization, we randomly selected stream-size aligned data from testing-set and make it as online streaming data which is the input of each algorithm.