Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches

Authors: Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, flipout achieves the ideal linear variance reduction for fully connected networks, convolutional networks, and RNNs. We find significant speedups in training neural networks with multiplicative Gaussian perturbations. We show that flipout is effective at regularizing LSTMs, and outperforms previous methods. Flipout also enables us to vectorize evolution strategies: in our experiments, a single GPU with flipout can handle the same throughput as at least 40 CPU cores using existing methods, equivalent to a factor-of-4 cost reduction on Amazon Web Services. 4 EXPERIMENTS
Researcher Affiliation Collaboration Yeming Wen, Paul Vicol, Jimmy Ba University of Toronto Vector Institute wenyemin,pvicol,jba@cs.toronto.edu Dustin Tran Columbia University Google trandustin@google.com Roger Grosse University of Toronto Vector Institute rgrosse@cs.toronto.ca
Pseudocode No No explicit pseudocode or algorithm blocks were found.
Open Source Code No The paper does not provide any statement about making the source code publicly available or link to a repository.
Open Datasets Yes Name Network Type Data Set Conv Le (Shallow) Convolutional MNIST (Le Cun et al., 1998) Con VGG (Deep) Convolutional CIFAR-10 (Krizhevsky & Hinton, 2009) FC Fully Connected MNIST LSTM LSTM Network Penn Treebank (Marcus et al., 1993) Table 1: Network Configurations
Dataset Splits Yes We perform early stopping based on validation performance. Here, we applied flipout to the hidden-to-hidden weight matrix. More hyperparameter details are given in Appendix D. The results, measured in bits-per-character (BPC) for the validation and test sequences of PTB, are shown in Table 2.
Hardware Specification No The paper mentions 'a single GPU' and 'multi-core CPU machines' but does not provide specific hardware models (e.g., GPU model, CPU type/model, or specific TPU version).
Software Dependencies No The paper mentions using 'Adam (Reddi et al., 2018)' but does not specify version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup Yes We trained each model on non-overlapping sequences of 100 characters in batches of size 32, using the AMSGrad variant of Adam (Reddi et al., 2018) with learning rate 0.002. We perform early stopping based on validation performance. Here, we applied flipout to the hidden-to-hidden weight matrix. More hyperparameter details are given in Appendix D. The results, measured in bits-per-character (BPC) for the validation and test sequences of PTB, are shown in Table 2. For our word-level experiments, we used a 2-layer LSTM with 650 hidden units per layer and 650-dimensional word embeddings. We trained on sequences of length 35 in batches of size 40, for 100 epochs. We used SGD with initial learning rate 30, and decayed the learning rate by a factor of 4 based on the nonmonotonic criterion introduced by Merity et al. (2017). We used flipout to implement Drop Connect, as described in Section 2.1, and call this WD+Flipout. We applied WD+Flipout to the hidden-to-hidden weight matrices for recurrent regularization, and used the same hyperparameters as Merity et al. (2017). We used embedding dropout (setting rows of the embedding matrix to 0) with probability 0.1 for all regularized models except Gal, where we used probability 0.2 as specified in their paper. More hyperparameter details are given in Appendix D.