Fraternal Dropout
Authors: Konrad Zolna, Devansh Arpit, Dendi Suhubdy, Yoshua Bengio
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our model and achieve state-of-the-art results in sequence modeling tasks on two benchmark datasets Penn Treebank and Wikitext-2. We also show that our approach leads to performance improvement by a significant margin in image captioning (Microsoft COCO) and semi-supervised (CIFAR-10) tasks. |
| Researcher Affiliation | Academia | Konrad Zołna1,2, , Devansh Arpit2, Dendi Suhubdy2 & Yoshua Bengio2,3 1Jagiellonian University 2MILA, Université de Montréal 3CIFAR Senior Fellow |
| Pseudocode | No | The paper describes the proposed method using mathematical equations and textual descriptions, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Our code is available at github.com/kondiz/fraternal-dropout . |
| Open Datasets | Yes | In the case of language modeling we test our model3 on two benchmark datasets Penn Tree-bank (PTB) dataset (Marcus et al., 1993) and Wiki Text-2 (WT2) dataset (Merity et al., 2016). We also apply fraternal dropout on an image captioning task. We use the well-known show and tell model as a baseline5 (Vinyals et al., 2014)... We use the CIFAR-10 dataset that consists of 32 32 images from 10 classes. |
| Dataset Splits | Yes | Following the usual splits used in semi-supervised learning literature, we use 4 thousand labeled and 41 thousand unlabeled samples for training, 5 thousand labeled samples for validation and 10 thousand labeled samples for test set. |
| Hardware Specification | No | The paper mentions fitting models on a 'single GPU' and references 'lack of computational power' but does not specify any particular GPU model, CPU, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions 'Py Torch implementation' in a footnote, but without a specific version number. It does not list other software dependencies with version numbers. |
| Experiment Setup | Yes | Hence, in our experiments we leave a vast majority of hyper-parameters used in the baseline model (Melis et al., 2017) unchanged i.e. embedding and hidden states sizes, gradient clipping value, weight decay and the values used for all dropout layers (dropout on the word vectors, the output between LSTM layers, the output of the final LSTM, and embedding dropout)... We use a batch size of 64, truncated back-propagation with 35 time steps, a constant zero state is provided as the initial state with probability 0.01 (similar to Melis et al. (2017)), SGD with learning rate 30 (no momentum) which is multiplied by 0.1 whenever validation performance does not improve ever during 20 epochs, weight dropout on the hidden to hidden matrix 0.5, dropout every word in a mini-batch with probability 0.1, embedding dropout 0.65, output dropout 0.4 (final value of LSTM), gradient clipping of 0.25, weight decay 1.2 10 6, input embedding size of 655, the input/output size of LSTM is the same as embedding size (655) and the embedding weights are tied (Inan et al., 2016; Press & Wolf, 2016). |