On the Weaknesses of Reinforcement Learning for Neural Machine Translation

Authors: Leshem Choshen, Lior Fox, Zohar Aizenbud, Omri Abend

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Second, using both naturalistic experiments and carefully constructed simulations, we show that performance gains observed in the literature likely stem not from making target tokens the most probable, but from unrelated effects, such as increasing the peakiness of the output distribution (i.e., the probability mass of the most probable tokens). We do so by comparing a setting where the reward is informative, vs. one where it is constant. In 4 we discuss this peakiness effect (PKE).
Researcher Affiliation Academia Leshem Choshen1, Lior Fox2, Zohar Aizenbud1, Omri Abend1,3 1 School of Computer Science and Engineering, 2 The Edmond and Lily Safra Center for Brain Sciences 3 Department of Cognitive Sciences The Hebrew University of Jerusalem first.last@mail.huji.ac.il, oabend@cs.huji.ac.il
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets Yes The model was pretrained on WMT2015 training data (Bojar et al., 2015). Hyperparameters are reported in Appendix A.3. We define one of the tokens in V to be the target token and denote it with ybest.
Dataset Splits Yes We use early stopping with a patience of 10 epochs, where each epoch consists of 5,000 sentences sampled from the WMT2015 (Bojar et al., 2015) German-English training data. We use k = 1. We retuned the learning-rate, and positive baseline settings against the development set. Other hyper-parameters were an exact replication of the experiments reported in (Yang et al., 2018).
Hardware Specification No Pretraining took about 7 days with 4 GPUs, afterwards, training took roughly the same time. This mentions the quantity of GPUs but not specific models or other hardware details.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment.
Experiment Setup Yes The size of V (distinct BPE tokens) is 30,715. For the MT experiments we used 6 layers in the encoder and the decoder. The size of the embeddings was 512. Gradient clipping was used with size of 5 for pre-training (see Discussion on why not to use it in training). We did not use attention dropout, but 0.1 residual dropout rate was used. In pretraining and training sentences of more than 50 tokens were discarded. Pretraining and training were considered finished when BLEU did not increase in the development set for 10 consecutive evaluations, and evaluation was done every 1,000 and 5,000 for batches of size 100 and 256 for pretraining and training respectively. Learning rate used for rmsprop (Tieleman & Hinton, 2012) was 0.01 in pretraining and for adam (Kingma & Ba, 2015) with decay was 0.005 for training. 4,000 learning rate warm up steps were used. Pretraining took about 7 days with 4 GPUs, afterwards, training took roughly the same time. Monte Carlo used 20 sentence rolls per word. We experiment with α = 0.005 and k = 20, common settings in the literature, and average over 100 trials.