Cascaded Text Generation with Markov Transformers
Authors: Yuntian Deng, Alexander Rush
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on five machine translation datasets compare this approach to other beam search and nonautoregressive baselines. Our inference approach is comparably fast to non-autoregressive methods while allowing for local dependencies in a principled, probabilistic way. Results validate the competitive accuracy/speed tradeoff of our approach compared to existing methods. |
| Researcher Affiliation | Academia | Yuntian Deng Harvard University dengyuntian@seas.harvard.edu Alexander M. Rush Cornell University arush@cornell.edu |
| Pseudocode | Yes | Algorithm 1 Parallel Cascaded Decoding |
| Open Source Code | Yes | The code for reproducing all results is available at https://github.com/harvardnlp/cascaded-generation. |
| Open Datasets | Yes | We evaluate our approach on five commonly used machine translation benchmark datasets: IWSLT14 De-En [6] ( 160k parallel sentences), WMT14 En-De/De-En1 [29] ( 4M parallel sentences) and WMT16 En-Ro/Ro-En2 [3] ( 610k parallel sentences). |
| Dataset Splits | Yes | We sample all validation datasets to be at most 3k. To process the data, we use Byte Pair Encoding (BPE) [46, 23] learned on the training set with a shared vocabulary between source and target. |
| Hardware Specification | Yes | We measure the average decoding time of a single sentence [13, 25, 16, 15, 55, 51] on a 12GB Nvidia Titan X GPU. |
| Software Dependencies | No | The paper mentions using FAIRSEQ [34] and PyTorch [35] but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Model Settings Markov transformer uses the same hyperparameters as standard transformers. The base settings are from FAIRSEQ3 [34]: For IWSLT14 De-En, we use 6 layers, 4 attention heads, model dimension 512, hidden dimension 1024; for WMT14 En-De/De-En and WMT16 En-Ro/Ro-En we use 6 layers, 8 attention heads, model dimension 512, hidden dimension 2048. It differs only in the application of attention barriers, where we set M = 4. The optimization settings can be found at supplementary materials. |