Cascaded Text Generation with Markov Transformers

Authors: Yuntian Deng, Alexander Rush

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on five machine translation datasets compare this approach to other beam search and nonautoregressive baselines. Our inference approach is comparably fast to non-autoregressive methods while allowing for local dependencies in a principled, probabilistic way. Results validate the competitive accuracy/speed tradeoff of our approach compared to existing methods.
Researcher Affiliation Academia Yuntian Deng Harvard University dengyuntian@seas.harvard.edu Alexander M. Rush Cornell University arush@cornell.edu
Pseudocode Yes Algorithm 1 Parallel Cascaded Decoding
Open Source Code Yes The code for reproducing all results is available at https://github.com/harvardnlp/cascaded-generation.
Open Datasets Yes We evaluate our approach on five commonly used machine translation benchmark datasets: IWSLT14 De-En [6] ( 160k parallel sentences), WMT14 En-De/De-En1 [29] ( 4M parallel sentences) and WMT16 En-Ro/Ro-En2 [3] ( 610k parallel sentences).
Dataset Splits Yes We sample all validation datasets to be at most 3k. To process the data, we use Byte Pair Encoding (BPE) [46, 23] learned on the training set with a shared vocabulary between source and target.
Hardware Specification Yes We measure the average decoding time of a single sentence [13, 25, 16, 15, 55, 51] on a 12GB Nvidia Titan X GPU.
Software Dependencies No The paper mentions using FAIRSEQ [34] and PyTorch [35] but does not provide specific version numbers for these software components.
Experiment Setup Yes Model Settings Markov transformer uses the same hyperparameters as standard transformers. The base settings are from FAIRSEQ3 [34]: For IWSLT14 De-En, we use 6 layers, 4 attention heads, model dimension 512, hidden dimension 1024; for WMT14 En-De/De-En and WMT16 En-Ro/Ro-En we use 6 layers, 8 attention heads, model dimension 512, hidden dimension 2048. It differs only in the application of attention barriers, where we set M = 4. The optimization settings can be found at supplementary materials.