Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Cascaded Text Generation with Markov Transformers
Authors: Yuntian Deng, Alexander Rush
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on five machine translation datasets compare this approach to other beam search and nonautoregressive baselines. Our inference approach is comparably fast to non-autoregressive methods while allowing for local dependencies in a principled, probabilistic way. Results validate the competitive accuracy/speed tradeoff of our approach compared to existing methods. |
| Researcher Affiliation | Academia | Yuntian Deng Harvard University EMAIL Alexander M. Rush Cornell University EMAIL |
| Pseudocode | Yes | Algorithm 1 Parallel Cascaded Decoding |
| Open Source Code | Yes | The code for reproducing all results is available at https://github.com/harvardnlp/cascaded-generation. |
| Open Datasets | Yes | We evaluate our approach on five commonly used machine translation benchmark datasets: IWSLT14 De-En [6] ( 160k parallel sentences), WMT14 En-De/De-En1 [29] ( 4M parallel sentences) and WMT16 En-Ro/Ro-En2 [3] ( 610k parallel sentences). |
| Dataset Splits | Yes | We sample all validation datasets to be at most 3k. To process the data, we use Byte Pair Encoding (BPE) [46, 23] learned on the training set with a shared vocabulary between source and target. |
| Hardware Specification | Yes | We measure the average decoding time of a single sentence [13, 25, 16, 15, 55, 51] on a 12GB Nvidia Titan X GPU. |
| Software Dependencies | No | The paper mentions using FAIRSEQ [34] and PyTorch [35] but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Model Settings Markov transformer uses the same hyperparameters as standard transformers. The base settings are from FAIRSEQ3 [34]: For IWSLT14 De-En, we use 6 layers, 4 attention heads, model dimension 512, hidden dimension 1024; for WMT14 En-De/De-En and WMT16 En-Ro/Ro-En we use 6 layers, 8 attention heads, model dimension 512, hidden dimension 2048. It differs only in the application of attention barriers, where we set M = 4. The optimization settings can be found at supplementary materials. |