Non-autoregressive Machine Translation with Probabilistic Context-free Grammar

Authors: Shangtong Gui, Chenze Shao, Zhengrui Ma, xishan zhang, Yunji Chen, Yang Feng

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on major WMT benchmarks for NAT (WMT14 En De, WMT17 Zh En, WMT16 En Ro), which shows that our method substantially improves the translation performance and achieves comparable performance to autoregressive Transformer [40] with only one-iteration parallel decoding. Moreover, PCFG-NAT allows for the generation of sentences in a more interpretable manner, thus bridging the gap between performance and explainability in neural machine translation.
Researcher Affiliation Collaboration 1State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences 2Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences 3University of Chinese Academy of Sciences 4Cambricon Technologies
Pseudocode Yes Listing 1: Code for Constructing a Support Tree in Section 3.1.2; Listing 2: Code for Constructing a Constructing a Full Binary Tree with deepth layer_size; Listing 3: Code for CYK Training of Local Prefix Tree in Section 3.3.1.; Listing 4: Code for CYK Training of Main Chain in Section 3.3.1.; Listing 5: Code for Glancing Training Best Parse Tree in Section 3.3.2.; Listing 6: Code for Viterbi Decoding Algorihtm in Section 3.4.
Open Source Code Yes 2Code is publicly available at https://github.com/ictnlp/PCFG-NAT.
Open Datasets Yes Dataset We conduct our experiments on WMT14 English German (En-De, 4.5M sentence pairs),WMT17 Chinese English (Zh-En, 20M sentence pairs) and WMT16 English Romanian (En-Ro, 610k sentence pairs).
Dataset Splits Yes For WMT14 En-De and WMT17 Zh-En, we measure validation BLEU for every epoch and average the 5 best checkpoints as the final model. For WMT16 En-Ro, we just use the best checkpoint on the valid dataset.
Hardware Specification Yes We use the NVIDIA Tesla V100S-PCIE-32GB GPU to measure the translation latency on the WMT14 En-De test set with a batch size of 1.
Software Dependencies No The paper mentions 'fairseq [27]' and 'nltk library [24]' but does not specify their version numbers, which is required for reproducibility.
Experiment Setup Yes For our method, we choose the l = 1, λ = 4 settings for Support Tree of RH-PCFG and linearly anneal τ from 0.5 to 0.1 for the glancing training. For DA-Transformer [14], we use λ = 8 for the graph size, which has comparable hidden states to our models. All models are optimized with Adam [20] with β = (0.9, 0.98) and ϵ = 10 8. For all models, each batch contains approximately 64K source words. All models are trained for 300K steps.