Reward Augmented Maximum Likelihood for Neural Structured Prediction

Authors: Mohammad Norouzi, Samy Bengio, zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on neural sequence to sequence models for speech recognition and machine translation show notable improvements over a maximum likelihood baseline by using reward augmented maximum likelihood (RML), where the rewards are defined as the negative edit distance between the outputs and the ground truth labels. We compare our approach, reward augmented maximum likelihood (RML), with standard maximum likelihood (ML) training on sequence prediction tasks using state-of-the-art attention-based recurrent neural networks [29, 2]. Our experiments demonstrate that the RML approach considerably outperforms ML baseline on both speech recognition and machine translation tasks.
Researcher Affiliation Industry Google Brain
Pseudocode No The paper describes its methods in text but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about the release of its source code, nor does it include links to a code repository.
Open Datasets Yes For experiments on speech recognition, we use the TIMIT dataset; a standard benchmark for clean phone recognition. This dataset consists of recordings from different speakers reading ten phonetically rich sentences covering major dialects of American English. We use the standard train / dev / test splits suggested by the Kaldi toolkit [24].
Dataset Splits Yes We use the standard train / dev / test splits suggested by the Kaldi toolkit [24].
Hardware Specification No The paper mentions training models using "asynchronous SGD with 12 replicas" but does not provide specific details about the hardware used (e.g., GPU/CPU models, memory).
Software Dependencies No The paper mentions tools like the "Kaldi toolkit" but does not specify version numbers for any software dependencies required to reproduce the experiments.
Experiment Setup Yes We use an attention-based encoder-decoder recurrent model of [5] with three 256-dimensional LSTM layers for encoding and one 256-dimensional LSTM layer for decoding. We train the models using asynchronous SGD with 12 replicas without momentum. We use mini-batches of size 128. We initially use a learning rate of 0.5, which we then exponentially decay to 0.05 after 800K steps. We perform beam search decoding with a beam size of 8.