reproducibilityindex.ai

Cascaded Text Generation with Markov Transformers

Authors: Yuntian Deng, Alexander Rush

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on ﬁve machine translation datasets compare this approach to other beam search and nonautoregressive baselines. Our inference approach is comparably fast to non-autoregressive methods while allowing for local dependencies in a principled, probabilistic way. Results validate the competitive accuracy/speed tradeoff of our approach compared to existing methods.
Researcher Affiliation	Academia	Yuntian Deng Harvard University dengyuntian@seas.harvard.edu Alexander M. Rush Cornell University arush@cornell.edu
Pseudocode	Yes	Algorithm 1 Parallel Cascaded Decoding
Open Source Code	Yes	The code for reproducing all results is available at https://github.com/harvardnlp/cascaded-generation.
Open Datasets	Yes	We evaluate our approach on ﬁve commonly used machine translation benchmark datasets: IWSLT14 De-En [6] ( 160k parallel sentences), WMT14 En-De/De-En1 [29] ( 4M parallel sentences) and WMT16 En-Ro/Ro-En2 [3] ( 610k parallel sentences).
Dataset Splits	Yes	We sample all validation datasets to be at most 3k. To process the data, we use Byte Pair Encoding (BPE) [46, 23] learned on the training set with a shared vocabulary between source and target.
Hardware Specification	Yes	We measure the average decoding time of a single sentence [13, 25, 16, 15, 55, 51] on a 12GB Nvidia Titan X GPU.
Software Dependencies	No	The paper mentions using FAIRSEQ [34] and PyTorch [35] but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Model Settings Markov transformer uses the same hyperparameters as standard transformers. The base settings are from FAIRSEQ3 [34]: For IWSLT14 De-En, we use 6 layers, 4 attention heads, model dimension 512, hidden dimension 1024; for WMT14 En-De/De-En and WMT16 En-Ro/Ro-En we use 6 layers, 8 attention heads, model dimension 512, hidden dimension 2048. It differs only in the application of attention barriers, where we set M = 4. The optimization settings can be found at supplementary materials.