reproducibilityindex.ai

Fraternal Dropout

Authors: Konrad Zolna, Devansh Arpit, Dendi Suhubdy, Yoshua Bengio

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our model and achieve state-of-the-art results in sequence modeling tasks on two benchmark datasets Penn Treebank and Wikitext-2. We also show that our approach leads to performance improvement by a signiﬁcant margin in image captioning (Microsoft COCO) and semi-supervised (CIFAR-10) tasks.
Researcher Affiliation	Academia	Konrad Zołna1,2, , Devansh Arpit2, Dendi Suhubdy2 & Yoshua Bengio2,3 1Jagiellonian University 2MILA, Université de Montréal 3CIFAR Senior Fellow
Pseudocode	No	The paper describes the proposed method using mathematical equations and textual descriptions, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Our code is available at github.com/kondiz/fraternal-dropout .
Open Datasets	Yes	In the case of language modeling we test our model3 on two benchmark datasets Penn Tree-bank (PTB) dataset (Marcus et al., 1993) and Wiki Text-2 (WT2) dataset (Merity et al., 2016). We also apply fraternal dropout on an image captioning task. We use the well-known show and tell model as a baseline5 (Vinyals et al., 2014)... We use the CIFAR-10 dataset that consists of 32 32 images from 10 classes.
Dataset Splits	Yes	Following the usual splits used in semi-supervised learning literature, we use 4 thousand labeled and 41 thousand unlabeled samples for training, 5 thousand labeled samples for validation and 10 thousand labeled samples for test set.
Hardware Specification	No	The paper mentions fitting models on a 'single GPU' and references 'lack of computational power' but does not specify any particular GPU model, CPU, or other detailed hardware specifications.
Software Dependencies	No	The paper mentions 'Py Torch implementation' in a footnote, but without a specific version number. It does not list other software dependencies with version numbers.
Experiment Setup	Yes	Hence, in our experiments we leave a vast majority of hyper-parameters used in the baseline model (Melis et al., 2017) unchanged i.e. embedding and hidden states sizes, gradient clipping value, weight decay and the values used for all dropout layers (dropout on the word vectors, the output between LSTM layers, the output of the ﬁnal LSTM, and embedding dropout)... We use a batch size of 64, truncated back-propagation with 35 time steps, a constant zero state is provided as the initial state with probability 0.01 (similar to Melis et al. (2017)), SGD with learning rate 30 (no momentum) which is multiplied by 0.1 whenever validation performance does not improve ever during 20 epochs, weight dropout on the hidden to hidden matrix 0.5, dropout every word in a mini-batch with probability 0.1, embedding dropout 0.65, output dropout 0.4 (ﬁnal value of LSTM), gradient clipping of 0.25, weight decay 1.2 10 6, input embedding size of 655, the input/output size of LSTM is the same as embedding size (655) and the embedding weights are tied (Inan et al., 2016; Press & Wolf, 2016).