reproducibilityindex.ai

Joint Training of Deep Ensembles Fails Due to Learner Collusion

Authors: Alan Jeffares, Tennison Liu, Jonathan Crabbé, Mihaela van der Schaar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we show that this is for good reason joint optimization of ensemble loss results in degenerate behavior. We approach this problem by decomposing the ensemble objective into the strength of the base learners and the diversity between them. We discover that joint optimization results in a phenomenon in which base learners collude to artificially inflate their apparent diversity. This pseudo-diversity fails to generalize beyond the training data, causing a larger generalization gap. We proceed to comprehensively demonstrate the practical implications of this effect on a range of standard machine learning tasks and architectures by smoothly interpolating between independent training and joint optimization.
Researcher Affiliation	Academia	Alan Jeffares University of Cambridge aj659@cam.ac.uk Tennison Liu University of Cambridge tl522@cam.ac.uk Jonathan Crabb e University of Cambridge jc2133@cam.ac.uk Mihaela van der Schaar University of Cambridge mv472@cam.ac.uk
Pseudocode	No	The paper does not contain any clearly labeled "Pseudocode" or "Algorithm" blocks with structured steps.
Open Source Code	Yes	Code is provided at https://github.com/alanjeffares/joint-ensembles.
Open Datasets	Yes	We compare the test set performance of independent and joint training on Image Net [34].
Dataset Splits	Yes	We use the standard Image Net training-validation splits.
Hardware Specification	Yes	All experiments are run on Azure NCv3-series VMs powered by NVIDIA Tesla V100 GPUs with 16GB of GPU VRAM.
Software Dependencies	No	The paper mentions using implementations from [49] (PyTorch) but does not provide specific version numbers for PyTorch or any other software libraries used.
Experiment Setup	Yes	For all models, we adjusted the following hyperparameters to optimize performance: learning rate, learning rate scheduler (and corresponding decay factor, gamma and frequency), momentum, and weight decay. Batch size was typically set to 128... We train models with early stopping and patience of 15 test set evaluations... We apply stochastic gradient descent as our optimizer.