Sequence Level Training with Recurrent Neural Networks

Authors: Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our second contribution is a thorough empirical evaluation on three different tasks, namely, Text Summarization, Machine Translation and Image Captioning. We compare against several strong baselines... Our results show that MIXER with a simple greedy search achieves much better accuracy compared to the baselines on all the three tasks.
Researcher Affiliation Industry Facebook AI Research {ranzato, spchopra, michealauli, wojciech}@fb.com
Pseudocode Yes Algorithm 1: MIXER pseudo-code.
Open Source Code Yes Code available at: https://github.com/facebookresearch/MIXER
Open Datasets Yes The data set we use to train and evaluate our models consists of a subset of the Gigaword corpus (Graff et al., 2003)... We use data from the German English machine translation track of the IWSLT 2014 evaluation campaign (Cettolo et al., 2014)... For the image captioning task, we use the MSCOCO dataset (Lin et al., 2014).
Dataset Splits Yes The number of sample pairs in the training, validation and test set are 179414, 22568, and 22259 respectively. (for summarization) ... The training data comprises of about 153000 sentences... Our validation set comprises of 6969 sentence pairs... The test set is a concatenation of dev2010, dev2012, tst2010, tst2011 and tst2012 which results in 6750 sentence pairs. (for machine translation) ... We use the entire training set provided by the authors, which consists of around 80k images. We then took the original validation set (consisting of around 40k images) and randomly sampled (without replacement) 5000 images for validation and another 5000 for test. (for image captioning)
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or processor types used for running its experiments.
Software Dependencies No The paper mentions using the 'tokenizer of the Moses toolkit' and 'Convolutional Neural Network (CNN) trained on the Imagenet dataset', but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes For training, we use stochastic gradient descent with mini-batches of size 32 and we reset the hidden states at the beginning of each sequence. Before updating the parameters we re-scale the gradients if their norm is above 10... We search over the values of hyper-parameter, such as the initial learning rate, the various scheduling parameters, number of epochs, etc., using a held-out validation set. Table 2: Best scheduling parameters found by hyper-parameter search of MIXER.