SEARNN: Training RNNs with global-local losses
Authors: Rémi Leblond, Jean-Baptiste Alayrac, Anton Osokin, Simon Lacoste-Julien
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In order to validate these theoretical benefits, we ran SEARNN on two datasets and compared its performance against that of MLE. For a fair comparison, we use the same optimization routine for all methods. We pick the one that performs best for the MLE baseline. Note that in all the experiments of the paper, we use greedy decoding, both for our cost computation and for evaluation. Furthermore, whenever we use a mixed roll-out we always use 0.5 as our mixin parameter, following Chang et al. (2015). |
| Researcher Affiliation | Academia | Rémi Leblond1, 2 Jean-Baptiste Alayrac1, 2 Anton Osokin1, 2, 3 Simon Lacoste-Julien4, 5 1Département d informatique de l ENS, Paris, France 2INRIA, École normale supérieure, CNRS, PSL Research University 3National Research University Higher School of Economics, Moscow, Russia 4Université de Montréal & Montreal Institute for Learning Algorithms (MILA) 5Canadian Institute for Advanced Research (CIFAR) {firstname.lastname}@inria.fr |
| Pseudocode | Yes | Algorithm 1 SEARNN algorithm (for a simple encoder-decoder network) |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | The first dataset is the optical character recognition (OCR) dataset introduced in Taskar et al. (2003). [...] The second dataset is the Spelling dataset introduced in Bahdanau et al. (2017). [...] We choose neural machine translation as out task, and the German-English translation track of the IWSLT 2014 campaign (Cettolo et al., 2014) as our dataset... |
| Dataset Splits | Yes | We reuse the pre-processing of Ranzato et al. (2016), obtaining training, validation and test datasets of roughly 153k, 7k and 7k sentence pairs respectively with vocabularies of size 22822 words for English and 32009 words for German. |
| Hardware Specification | No | The paper describes the models used (e.g., encoder-decoder model with GRU cells of size 128) but does not provide specific hardware details such as GPU or CPU models, processor types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using specific optimizers like 'Adam optimizer (Kingma & Ba, 2015)' but does not provide version numbers for any software dependencies, libraries, or programming languages used in the experiments. |
| Experiment Setup | Yes | For all runs, we use SGD with constant step-size 0.5 and batch size of 64. [...] For all runs, we use the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.001 and batch size of 128. [...] We use Adam as our optimizer, with an initial learning rate of 10 3 gradually decreasing to 10 5, and a batch size of 64. We select the best models on the validation set and report results both without and with dropout (0.3). |