reproducibilityindex.ai

Directed Acyclic Transformer for Non-Autoregressive Machine Translation

Authors: Fei Huang, Hao Zhou, Yang Liu, Hang Li, Minlie Huang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on the raw training data of WMT benchmark show that DA-Transformer substantially outperforms previous NATs by about 3 BLEU on average, which is the first NAT model that achieves competitive results with autoregressive Transformers without relying on knowledge distillation.
Researcher Affiliation	Collaboration	1The Co AI group, Tsinghua University, China. 2Institute for Artificial Intelligence, State Key Lab of Intelligent Technology and Systems, Beijing National Research Center for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, China. 3Byte Dance AI Lab.
Pseudocode	Yes	Algorithm 1 Greedy / Lookahead Decoding in Pytorch-like Parallel Pseudocode; Algorithm 2 Dynamic Programming Algorithm in Pytorch-like Parallel Pseudocode; Algorithm 3 Beam Search for DA-Transformer
Open Source Code	Yes	We will release an efficient C++ implementation at https: //github.com/thu-coai/DA-Transformer.
Open Datasets	Yes	Dataset We conduct experiments on two benchmarks, WMT14 En De (4.5M) and WMT17 Zh En (20M), where we follow Zhou et al. (2020); Kasai et al. (2020) for pre-processing.
Dataset Splits	Yes	All models, including ATs, are trained for 300k updates with a batch of 64k tokens... We evaluate the BLEU scores on the validation set every epoch and average the best 5 checkpoints for the final model.
Hardware Specification	Yes	The training lasts approximately 32 hours on 16 Nvidia V100-32G GPUs.
Software Dependencies	No	The paper indicates PyTorch-like pseudocode but does not specify software dependencies with version numbers.
Experiment Setup	Yes	For regularization, we set dropout to 0.1, weight decay to 0.01, and label smoothing to 0.1. All models, including ATs, are trained for 300k updates with a batch of 64k tokens. The learning rate warms up to 5 10 4 within 10k steps and then decays with the inverse square-root schedule... For DA-Transformer, we use λ = 8... We linearly anneal τ from 0.5 to 0.1 for glancing training. For Beam Search, we set beam size to 200, γ to 0.1, and tune α from [1, 1.4] on the validation set.