Revisiting Self-Training for Neural Sequence Generation
Authors: Junxian He, Jiatao Gu, Jiajun Shen, Marc'Aurelio Ranzato
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we first empirically show that selftraining is able to decently improve the supervised baseline on neural sequence generation tasks. Empirical study on standard machine translation and text summarization benchmarks shows that noisy self-training is able to effectively utilize unlabeled data and improve the performance of the supervised baseline by a large margin. |
| Researcher Affiliation | Collaboration | Junxian He Carnegie Mellon University junxianh@cs.cmu.edu Jiatao Gu , Jiajun Shen, Marc Aurelio Ranzato Facebook AI Research, New York, NY {jgu,jiajunshen,ranzato}@fb.com |
| Pseudocode | Yes | Algorithm 1 Classic Self-training |
| Open Source Code | Yes | Code is available at https://github.com/jxhe/self-training-text-generation. |
| Open Datasets | Yes | We work with the standard WMT 2014 English-German dataset... We evaluate noisy self-training on a low-resource machine translation dataset Flo Res (Guzm an et al., 2019) from English (en) to Nepali (ne)... |
| Dataset Splits | Yes | We randomly sample 250 instances for training, 100 for validation, 5000 for test, and 4000 as the unlabeled data. All experiments are validated with loss on the validation set. |
| Hardware Specification | No | No specific hardware models (e.g., GPU/CPU models) are mentioned. The paper only mentions: 'Except for the toy sum dataset which we runs on a single GPU and each batch contains 32 examples, all other experiments are run on 8 GPUs with an effective batch size of 33K tokens.' |
| Software Dependencies | No | No specific software dependencies with version numbers are provided. The paper states: 'All experiments throughout this paper including the transformer implementation are based on the fairseq toolkit (Ott et al., 2019)' and 'For all experiments, we optimize with Adam (Kingma & Ba, 2014)'. |
| Experiment Setup | Yes | We train with the Base Transformer architecture (Vaswani et al., 2017) and dropout rate at 0.3. We use beam search decoding (beam size 5) to create the pseudo targets. For all experiments, we optimize with Adam (Kingma & Ba, 2014) using β1 = 0.9, β2 = 0.98, ϵ = 1e 8. We use Adam optimizer with learning rate 0.0005, which is defaulted in fairseq. The pseudo-training takes 300K synchronous updates while the fine-tuning step takes 100K steps. The model architecture for the toy sum dataset is a single-layer LSTM with word embedding size 32, hidden state size 32, and dropout rate 0.3. |