Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Contextual Active Model Selection
Authors: Xuefeng Liu, Fangfang Xia, Rick Stevens, Yuxin Chen
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate the effectiveness and robustness of our approach on a variety of online model selection tasks spanning different application domains (from generic ML benchmarks such as CIFAR10 to domain-specific tasks in biomedical analysis), data scales (ranging from 80 to 10K), data modalities (i.e., tabular, image, and graph-based data), and label types (binary or multiclass labels). For the tasks evaluated, (1) CAMS outperforms all competing baselines by a significant margin. |
| Researcher Affiliation | Academia | Xuefeng Liu1 , Fangfang Xia2, Rick L. Stevens1,2, Yuxin Chen1 1Department of Computer Science, University of Chicago 2Argonne National Laboratory |
| Pseudocode | Yes | Figure 1: The Contextual Active Model Selection (CAMS) algorithm |
| Open Source Code | Yes | We provide the code and data in the supplementary material with a readme.txt for reproducing the results. Experiment details are listed in Section 6 and Appendix G, D.6. (from NeurIPS Paper Checklist, Section 5) |
| Open Datasets | Yes | Datasets. We evaluate our approach using five datasets: (1) CIFAR10 [41]... (2) DRIFT [73]... (3) VERTEBRAL [5]... (4) HIV [74]... (5) Cov Type [24]... |
| Dataset Splits | No | The paper mentions training and test sets but does not specify explicit validation set splits (e.g., percentages or counts) or a distinct validation phase with defined splits for hyperparameter tuning in the main experimental setup. It mentions 'randomly selected stream-size aligned data from testing-set' for online streaming. |
| Hardware Specification | Yes | We performed our experiments on a Linux server with 80 Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and total 528 Gigabyte memory. |
| Software Dependencies | No | The paper mentions software like 'VGG', 'Res Net', 'Dense Net', 'scikit-learn built-in models', but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We set 100 realizations and 3000 stream-size for DRIFT, 20 realizations and 10000 stream-size for CIFAR10, 200 realizations and 4000 stream size for HIV, 300 realization and 80 stream-size for VERTEBRAL. In each realization, we randomly selected stream-size aligned data from testing-set and make it as online streaming data which is the input of each algorithm. |