On the Learning of Non-Autoregressive Transformers

Authors: Fei Huang, Tianhua Tao, Hao Zhou, Lei Li, Minlie Huang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies show that our perspective can explain the phenomena in NAT learning and guide the design of new training methods.
Researcher Affiliation Collaboration Fei Huang * 1 Tianhua Tao * 1 Hao Zhou 2 Lei Li 3 Minlie Huang 1 ... 1The Co AI group, Tsinghua University. ... 2Institute for AI Industry Research, Tsinghua University. 3University of California Santa Barbara. *Equal contribution This work is done while Fei Huang was a research intern and Hao Zhou was a research scientist at Byte Dance AI Lab.
Pseudocode No No section or block explicitly labeled 'Pseudocode' or 'Algorithm' was found, nor were any code-like formatted procedures presented in the paper.
Open Source Code No No explicit statement about releasing open-source code for the methodology or a direct link to a code repository was found in the paper.
Open Datasets Yes We use two translation benchmarks, WMT14 En De (4.5M) and WMT17 Zh-En (20M), and follow Zhou et al. (2020); Kasai et al. (2020) for preprocessing.
Dataset Splits No The paper mentions using a 'validation set' for evaluation and model selection ('The likelihood is obtained on the validation set.', 'We evaluate the BLEU scores on the validation set every epoch'), but it does not explicitly provide the specific percentages or sample counts for the training, validation, and test dataset splits.
Hardware Specification Yes All models are trained with mixed precision floating point arithmetic on 8 Nvidia V100-32G GPUs.
Software Dependencies No The paper states, 'All our models are implemented with Fairseq (Ott et al., 2019)', but does not provide specific version numbers for Fairseq or any other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For regularization, we set dropout to 0.1, weight decay to 0.01, and label smoothing to 0.1. Except for Oa XE, all models are trained for 300k updates with a batch of approximately 64k tokens. The learning rate warms up to 5 × 10−4 within 10k steps and then decays with the inverse square-root schedule.