BatchEnsemble: an Alternative Approach to Efficient Ensemble and Lifelong Learning

Authors: Yeming Wen, Dustin Tran, Jimmy Ba

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across CIFAR-10, CIFAR-100, WMT14 EN-DE/EN-FR translation, and out-of-distribution tasks, Batch Ensemble yields competitive accuracy and uncertainties as typical ensembles; the speedup at test time is 3X and memory reduction is 3X at an ensemble of size 4. Empirically, we show that Batch Ensemble has the best trade-off among accuracy, running time, and memory on several deep learning architectures and learning tasks: CIFAR-10/100 classification with Res Net32 (He et al., 2016) and WMT14 EN-DE/EN-FR machine translation with Transformer (Vaswani et al., 2017).
Researcher Affiliation Collaboration Yeming Wen1,2,3 , Dustin Tran3 & Jimmy Ba1,2 1University of Toronto, 2Vector Institute, 3Google Brain
Pseudocode No The paper describes its methods using text and mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No A footnote on page 1 mentions '1https://github.com/google/edward2', but it does not explicitly state that the code for the methodology described in this paper is available there. It appears to be a general library.
Open Datasets Yes CIFAR: We consider two CIFAR datasets, CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). WMT: In machine translation tasks, we consider the standard training datasets WMT16 English German and WMT14 English-French. ... (Vaswani et al., 2017). Split-CIFAR100 proposed in Rebuffiet al. (2016)... Split-Image Net: The dataset has the same set of images as Image Net dataset (Deng et al., 2009).
Dataset Splits Yes Newstest2013 and Newstest2014 are used as validation set and test set respectively. We consider T = 20 tasks on Split-CIFAR100, following the setup of Lopez-Paz & Ranzato (2017). Split-CIFAR100: It randomly splits the entire dataset into T tasks so each task consists of 100/T classes of images.
Hardware Specification Yes Experiments are run on 4 NVIDIA P100 GPUs.
Software Dependencies No The paper mentions deep learning architectures like ResNet32 and Transformer, and a library Edward2, but it does not specify version numbers for any software dependencies, such as deep learning frameworks or libraries.
Experiment Setup Yes The Transformer base is trained for 100K steps and the Transformer big is trained for 180K steps. We train the model with mini-batch size 128. The learning rate decreases from 0.1 to 0.01, from 0.01 to 0.001 at halfway of training and 75% of training. The weight decay coefficient is set to be 10 4.