Joint Training of Deep Ensembles Fails Due to Learner Collusion
Authors: Alan Jeffares, Tennison Liu, Jonathan Crabbé, Mihaela van der Schaar
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we show that this is for good reason joint optimization of ensemble loss results in degenerate behavior. We approach this problem by decomposing the ensemble objective into the strength of the base learners and the diversity between them. We discover that joint optimization results in a phenomenon in which base learners collude to artificially inflate their apparent diversity. This pseudo-diversity fails to generalize beyond the training data, causing a larger generalization gap. We proceed to comprehensively demonstrate the practical implications of this effect on a range of standard machine learning tasks and architectures by smoothly interpolating between independent training and joint optimization. |
| Researcher Affiliation | Academia | Alan Jeffares University of Cambridge aj659@cam.ac.uk Tennison Liu University of Cambridge tl522@cam.ac.uk Jonathan Crabb e University of Cambridge jc2133@cam.ac.uk Mihaela van der Schaar University of Cambridge mv472@cam.ac.uk |
| Pseudocode | No | The paper does not contain any clearly labeled "Pseudocode" or "Algorithm" blocks with structured steps. |
| Open Source Code | Yes | Code is provided at https://github.com/alanjeffares/joint-ensembles. |
| Open Datasets | Yes | We compare the test set performance of independent and joint training on Image Net [34]. |
| Dataset Splits | Yes | We use the standard Image Net training-validation splits. |
| Hardware Specification | Yes | All experiments are run on Azure NCv3-series VMs powered by NVIDIA Tesla V100 GPUs with 16GB of GPU VRAM. |
| Software Dependencies | No | The paper mentions using implementations from [49] (PyTorch) but does not provide specific version numbers for PyTorch or any other software libraries used. |
| Experiment Setup | Yes | For all models, we adjusted the following hyperparameters to optimize performance: learning rate, learning rate scheduler (and corresponding decay factor, gamma and frequency), momentum, and weight decay. Batch size was typically set to 128... We train models with early stopping and patience of 15 test set evaluations... We apply stochastic gradient descent as our optimizer. |