Joint Training of Deep Ensembles Fails Due to Learner Collusion

Authors: Alan Jeffares, Tennison Liu, Jonathan Crabbé, Mihaela van der Schaar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we show that this is for good reason joint optimization of ensemble loss results in degenerate behavior. We approach this problem by decomposing the ensemble objective into the strength of the base learners and the diversity between them. We discover that joint optimization results in a phenomenon in which base learners collude to artificially inflate their apparent diversity. This pseudo-diversity fails to generalize beyond the training data, causing a larger generalization gap. We proceed to comprehensively demonstrate the practical implications of this effect on a range of standard machine learning tasks and architectures by smoothly interpolating between independent training and joint optimization.
Researcher Affiliation Academia Alan Jeffares University of Cambridge aj659@cam.ac.uk Tennison Liu University of Cambridge tl522@cam.ac.uk Jonathan Crabb e University of Cambridge jc2133@cam.ac.uk Mihaela van der Schaar University of Cambridge mv472@cam.ac.uk
Pseudocode No The paper does not contain any clearly labeled "Pseudocode" or "Algorithm" blocks with structured steps.
Open Source Code Yes Code is provided at https://github.com/alanjeffares/joint-ensembles.
Open Datasets Yes We compare the test set performance of independent and joint training on Image Net [34].
Dataset Splits Yes We use the standard Image Net training-validation splits.
Hardware Specification Yes All experiments are run on Azure NCv3-series VMs powered by NVIDIA Tesla V100 GPUs with 16GB of GPU VRAM.
Software Dependencies No The paper mentions using implementations from [49] (PyTorch) but does not provide specific version numbers for PyTorch or any other software libraries used.
Experiment Setup Yes For all models, we adjusted the following hyperparameters to optimize performance: learning rate, learning rate scheduler (and corresponding decay factor, gamma and frequency), momentum, and weight decay. Batch size was typically set to 128... We train models with early stopping and patience of 15 test set evaluations... We apply stochastic gradient descent as our optimizer.