reproducibilityindex.ai

Cross-validation Confidence Intervals for Test Error

Authors: Pierre Bayle, Alexandre Bayle, Lucas Janson, Lester Mackey

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our real-data experiments with diverse learning algorithms, the resulting intervals and tests outperform the most popular alternative methods from the literature.
Researcher Affiliation	Collaboration	Pierre Bayle Princeton University pbayle@princeton.edu Alexandre Bayle Harvard University alexandre bayle@g.harvard.edu Lucas Janson Harvard University ljanson@fas.harvard.edu Lester Mackey Microsoft Research New England lmackey@microsoft.com
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Complete experimental details are available in App. K.1, and code replicating all experiments can be found at https://github.com/alexandre-bayle/cvci.
Open Datasets	Yes	We use the Higgs dataset of [6, 7] to study the classiﬁcation error... and the Kaggle Flight Delays dataset of [1] to study the mean-squared regression error...
Dataset Splits	Yes	We ﬁx k = 10, use 90-10 train-validation splits for all tests save 5 2-fold CV, and report our results using ˆσ2 n,out (as ˆσ2 n,in results are nearly identical).
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models or memory used for running experiments.
Software Dependencies	No	The paper mentions machine learning libraries like Scikit-learn (in references) and various algorithms but does not specify exact version numbers for any software dependencies.
Experiment Setup	Yes	We ﬁx k = 10, use 90-10 train-validation splits for all tests save 5 2-fold CV... Complete experimental details are available in App. K.1... For random forest, we used RandomForestClassifier and RandomForestRegressor from scikit-learn [44] with max_depth=6 for Higgs and max_depth=10 for Flight Delays, and n_estimators=100. For neural network classiﬁcation, we used a three-layer neural network with 100 units per layer, ReLU activation, Adam optimizer (with learning rate 10 3), and batch size 64. For ℓ2-penalized logistic regression classiﬁcation and ridge regression, we used LogisticRegression and Ridge from scikit-learn [44], respectively, with penalty parameter α = 1.