Revisiting Self-Training for Neural Sequence Generation

Authors: Junxian He, Jiatao Gu, Jiajun Shen, Marc'Aurelio Ranzato

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we first empirically show that selftraining is able to decently improve the supervised baseline on neural sequence generation tasks. Empirical study on standard machine translation and text summarization benchmarks shows that noisy self-training is able to effectively utilize unlabeled data and improve the performance of the supervised baseline by a large margin.
Researcher Affiliation Collaboration Junxian He Carnegie Mellon University junxianh@cs.cmu.edu Jiatao Gu , Jiajun Shen, Marc Aurelio Ranzato Facebook AI Research, New York, NY {jgu,jiajunshen,ranzato}@fb.com
Pseudocode Yes Algorithm 1 Classic Self-training
Open Source Code Yes Code is available at https://github.com/jxhe/self-training-text-generation.
Open Datasets Yes We work with the standard WMT 2014 English-German dataset... We evaluate noisy self-training on a low-resource machine translation dataset Flo Res (Guzm an et al., 2019) from English (en) to Nepali (ne)...
Dataset Splits Yes We randomly sample 250 instances for training, 100 for validation, 5000 for test, and 4000 as the unlabeled data. All experiments are validated with loss on the validation set.
Hardware Specification No No specific hardware models (e.g., GPU/CPU models) are mentioned. The paper only mentions: 'Except for the toy sum dataset which we runs on a single GPU and each batch contains 32 examples, all other experiments are run on 8 GPUs with an effective batch size of 33K tokens.'
Software Dependencies No No specific software dependencies with version numbers are provided. The paper states: 'All experiments throughout this paper including the transformer implementation are based on the fairseq toolkit (Ott et al., 2019)' and 'For all experiments, we optimize with Adam (Kingma & Ba, 2014)'.
Experiment Setup Yes We train with the Base Transformer architecture (Vaswani et al., 2017) and dropout rate at 0.3. We use beam search decoding (beam size 5) to create the pseudo targets. For all experiments, we optimize with Adam (Kingma & Ba, 2014) using β1 = 0.9, β2 = 0.98, ϵ = 1e 8. We use Adam optimizer with learning rate 0.0005, which is defaulted in fairseq. The pseudo-training takes 300K synchronous updates while the fine-tuning step takes 100K steps. The model architecture for the toy sum dataset is a single-layer LSTM with word embedding size 32, hidden state size 32, and dropout rate 0.3.