reproducibilityindex.ai

Selective Ensembles for Consistent Predictions

Authors: Emily Black, Klas Leino, Matt Fredrikson

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that on seven benchmark datasets, selective ensembles of just ten models either agree on the entire test data across random differences in how their constituent models are trained, or abstain at reasonably low rates (1-5% in most cases; Section 5.1). Additionally, we show that simple ensembling doubles the agreement of attributions on key metrics on average, and when the variance of the constituent models is high that selective ensembling further enhances this effect (Section 5.2).In summary, our contributions are: (1) we show that beyond predictions, feature attributions are not consistent across seemingly inconsequential random choices during learning (Section 3); (2) we introduce selective ensembling, a learning method that guarantees bounded inconsistency in predictions, (Section 4); and (3) we demonstrate the effectiveness of this approach on seven datasets, showing that selective ensembles consistently predict all points across models trained with different random seeds or leave-one-out differences in their training data, while also achieving low abstention rates and higher feature attribution consistency.
Researcher Affiliation	Academia	Emily Black, Klas Leino , Matt Fredrikson {emilybla, kleino, mfredrik} @cs.cmu.edu Carnegie Mellon University
Pseudocode	Yes	Algorithm 1: Selective Ensemble Creation def train_ensemble(P, S Sn, n): return {P(Si) for i [n]} def sample_ensemble(P, S, n): S sample_iid(Sn) return train_ensemble(P, S, n) Algorithm 2: Selective Ensemble Prediction def ensemble_predict(ˆgn(P,S), α, x): h ˆgn(P,S)one_hot(h(x)) n A, n B top_2(Y ) if binom_p_value(n A, n A+n B, 0.5) α then return argmax(Y ) else return ABSTAIN
Open Source Code	No	The paper does not provide an explicit statement or link to open-source code for their methodology.
Open Datasets	Yes	Our experiments consider seven datasets: UCI German Credit, Adult, Taiwanese Credit Default, Seizure, all from Dua and Karra Taniskidou (2017); the IWPC Warfarin Dosing Recommendation (International Warfarin Pharmacogenetic Consortium, 2009), Fashion MNIST (Xiao et al., 2017), and Colorectal Histology (Kather et al., 2016a). ... The UCI datasets as well as FMNIST are under an MIT license, the colorectal histology and Warfarin datasets are under a Creative Commons License. (Dua and Karra Taniskidou, 2017; Kather et al., 2016b; International Warfarin Pharmacogenetic Consortium, 2009).
Dataset Splits	Yes	We partitioned the data intro a training set of 700 and a test set of 200. The Taiwanese credit dataset has 30,000 instances with 24 attributes. We one-hot encode the data to get 32 features and normalize the data to be between zero and one. We partitioned the data intro a training set of 22500 and a test set of 7500. ... The Adult dataset ... we split into a training set of 14891, a leave one out set of 100, and a test set of 1501 examples. ... The Seizure dataset ... We split this into 7,950 train points and 3,550 test points. ... Fashion MNIST contains ... 60000 training examples and 10000 test examples. ... The colorectal histology dataset ... we divide into a training set of 3750 and a validation set of 1250.
Hardware Specification	Yes	We prepare different models for the same dataset using Tensorflow 2.3.0 and all computations are done using a Titan RTX accelerator on a machine with 64 gigabytes of memory.
Software Dependencies	Yes	All experiments are implemented in Tensor Flow 2.3. For each tabular, we train 500 models from independent samples of the relevant source of randomness (e.g. leave-one-out data variations or random seeds), and for each image dataset, we train 200 models from independent samples of each source of randomness. ... We prepare different models for the same dataset using Tensorflow 2.3.0 and all computations are done using a Titan RTX accelerator on a machine with 64 gigabytes of memory.
Experiment Setup	Yes	The German Credit and Seizure models have three hidden layers, of size 128, 64, and 16. Models on the Adult dataset have one hidden layer of 200 neurons. Models on the Taiwanese dataset have two hidden layers of 32 and 16. The Warfarin models have one hidden layer of 100. The FMNIST model is a modified Le Net architecture (Le Cun et al., 1995). This model is trained with dropout. The Colon models are trained with a modified, Res Net50 (He et al., 2016), pre-trained on Image Net (Deng et al., 2009), available from Keras. German Credit, Adult, Seizure, Taiwanese, and Warfarin models are trained for 100 epochs; FMNIST for 50, and Colon models are trained for 20 epochs. German Credit models are trained with a batch size of 32; FMNIST 64; Adult, Seizure, and Warfarin models with batch sizes of 128; and Colon and Taiwanese Credit models with batch sizes of 512. German Credit, Adult, Seizure, Taiwanese Credit, Warfarin, and Colon are trained with keras Adam optimizer with the default parameters. FMNIST models are trained with keras SGD optimizer with the default parameters.