Cross-validation Confidence Intervals for Test Error

Authors: Pierre Bayle, Alexandre Bayle, Lucas Janson, Lester Mackey

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our real-data experiments with diverse learning algorithms, the resulting intervals and tests outperform the most popular alternative methods from the literature.
Researcher Affiliation Collaboration Pierre Bayle Princeton University pbayle@princeton.edu Alexandre Bayle Harvard University alexandre bayle@g.harvard.edu Lucas Janson Harvard University ljanson@fas.harvard.edu Lester Mackey Microsoft Research New England lmackey@microsoft.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Complete experimental details are available in App. K.1, and code replicating all experiments can be found at https://github.com/alexandre-bayle/cvci.
Open Datasets Yes We use the Higgs dataset of [6, 7] to study the classification error... and the Kaggle Flight Delays dataset of [1] to study the mean-squared regression error...
Dataset Splits Yes We fix k = 10, use 90-10 train-validation splits for all tests save 5 2-fold CV, and report our results using ˆσ2 n,out (as ˆσ2 n,in results are nearly identical).
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models or memory used for running experiments.
Software Dependencies No The paper mentions machine learning libraries like Scikit-learn (in references) and various algorithms but does not specify exact version numbers for any software dependencies.
Experiment Setup Yes We fix k = 10, use 90-10 train-validation splits for all tests save 5 2-fold CV... Complete experimental details are available in App. K.1... For random forest, we used RandomForestClassifier and RandomForestRegressor from scikit-learn [44] with max_depth=6 for Higgs and max_depth=10 for Flight Delays, and n_estimators=100. For neural network classification, we used a three-layer neural network with 100 units per layer, ReLU activation, Adam optimizer (with learning rate 10 3), and batch size 64. For ℓ2-penalized logistic regression classification and ridge regression, we used LogisticRegression and Ridge from scikit-learn [44], respectively, with penalty parameter α = 1.