Directed Acyclic Transformer for Non-Autoregressive Machine Translation
Authors: Fei Huang, Hao Zhou, Yang Liu, Hang Li, Minlie Huang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the raw training data of WMT benchmark show that DA-Transformer substantially outperforms previous NATs by about 3 BLEU on average, which is the first NAT model that achieves competitive results with autoregressive Transformers without relying on knowledge distillation. |
| Researcher Affiliation | Collaboration | 1The Co AI group, Tsinghua University, China. 2Institute for Artificial Intelligence, State Key Lab of Intelligent Technology and Systems, Beijing National Research Center for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, China. 3Byte Dance AI Lab. |
| Pseudocode | Yes | Algorithm 1 Greedy / Lookahead Decoding in Pytorch-like Parallel Pseudocode; Algorithm 2 Dynamic Programming Algorithm in Pytorch-like Parallel Pseudocode; Algorithm 3 Beam Search for DA-Transformer |
| Open Source Code | Yes | We will release an efficient C++ implementation at https: //github.com/thu-coai/DA-Transformer. |
| Open Datasets | Yes | Dataset We conduct experiments on two benchmarks, WMT14 En De (4.5M) and WMT17 Zh En (20M), where we follow Zhou et al. (2020); Kasai et al. (2020) for pre-processing. |
| Dataset Splits | Yes | All models, including ATs, are trained for 300k updates with a batch of 64k tokens... We evaluate the BLEU scores on the validation set every epoch and average the best 5 checkpoints for the final model. |
| Hardware Specification | Yes | The training lasts approximately 32 hours on 16 Nvidia V100-32G GPUs. |
| Software Dependencies | No | The paper indicates PyTorch-like pseudocode but does not specify software dependencies with version numbers. |
| Experiment Setup | Yes | For regularization, we set dropout to 0.1, weight decay to 0.01, and label smoothing to 0.1. All models, including ATs, are trained for 300k updates with a batch of 64k tokens. The learning rate warms up to 5 10 4 within 10k steps and then decays with the inverse square-root schedule... For DA-Transformer, we use λ = 8... We linearly anneal τ from 0.5 to 0.1 for glancing training. For Beam Search, we set beam size to 200, γ to 0.1, and tune α from [1, 1.4] on the validation set. |