reproducibilityindex.ai

Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches

Authors: Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, ﬂipout achieves the ideal linear variance reduction for fully connected networks, convolutional networks, and RNNs. We ﬁnd signiﬁcant speedups in training neural networks with multiplicative Gaussian perturbations. We show that ﬂipout is effective at regularizing LSTMs, and outperforms previous methods. Flipout also enables us to vectorize evolution strategies: in our experiments, a single GPU with ﬂipout can handle the same throughput as at least 40 CPU cores using existing methods, equivalent to a factor-of-4 cost reduction on Amazon Web Services. 4 EXPERIMENTS
Researcher Affiliation	Collaboration	Yeming Wen, Paul Vicol, Jimmy Ba University of Toronto Vector Institute wenyemin,pvicol,jba@cs.toronto.edu Dustin Tran Columbia University Google trandustin@google.com Roger Grosse University of Toronto Vector Institute rgrosse@cs.toronto.ca
Pseudocode	No	No explicit pseudocode or algorithm blocks were found.
Open Source Code	No	The paper does not provide any statement about making the source code publicly available or link to a repository.
Open Datasets	Yes	Name Network Type Data Set Conv Le (Shallow) Convolutional MNIST (Le Cun et al., 1998) Con VGG (Deep) Convolutional CIFAR-10 (Krizhevsky & Hinton, 2009) FC Fully Connected MNIST LSTM LSTM Network Penn Treebank (Marcus et al., 1993) Table 1: Network Conﬁgurations
Dataset Splits	Yes	We perform early stopping based on validation performance. Here, we applied ﬂipout to the hidden-to-hidden weight matrix. More hyperparameter details are given in Appendix D. The results, measured in bits-per-character (BPC) for the validation and test sequences of PTB, are shown in Table 2.
Hardware Specification	No	The paper mentions 'a single GPU' and 'multi-core CPU machines' but does not provide specific hardware models (e.g., GPU model, CPU type/model, or specific TPU version).
Software Dependencies	No	The paper mentions using 'Adam (Reddi et al., 2018)' but does not specify version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup	Yes	We trained each model on non-overlapping sequences of 100 characters in batches of size 32, using the AMSGrad variant of Adam (Reddi et al., 2018) with learning rate 0.002. We perform early stopping based on validation performance. Here, we applied ﬂipout to the hidden-to-hidden weight matrix. More hyperparameter details are given in Appendix D. The results, measured in bits-per-character (BPC) for the validation and test sequences of PTB, are shown in Table 2. For our word-level experiments, we used a 2-layer LSTM with 650 hidden units per layer and 650-dimensional word embeddings. We trained on sequences of length 35 in batches of size 40, for 100 epochs. We used SGD with initial learning rate 30, and decayed the learning rate by a factor of 4 based on the nonmonotonic criterion introduced by Merity et al. (2017). We used ﬂipout to implement Drop Connect, as described in Section 2.1, and call this WD+Flipout. We applied WD+Flipout to the hidden-to-hidden weight matrices for recurrent regularization, and used the same hyperparameters as Merity et al. (2017). We used embedding dropout (setting rows of the embedding matrix to 0) with probability 0.1 for all regularized models except Gal, where we used probability 0.2 as speciﬁed in their paper. More hyperparameter details are given in Appendix D.