PARABANK: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-Constrained Neural Machine Translation
Authors: J. Edward Hu, Rachel Rudinger, Matt Post, Benjamin Van Durme6521-6528
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present PARABANK, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of PARANMT (Wieting and Gimpel, 2018), we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. ... Using human judgments, we also demonstrate that PARABANK s paraphrases improve over PARANMT on both semantic similarity and fluency. |
| Researcher Affiliation | Academia | J. Edward Hu, Rachel Rudinger, Matt Post, Benjamin Van Durme 3400 North Charles Street Johns Hopkins University Baltimore, MD, USA |
| Pseudocode | No | The paper describes the methods in narrative text and does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Using PARABANK, we train, and release to the public, a monolingual sentence re-writing system, which may be used to paraphrase unseen English sentences with lexical constraints. ... PARABANK is available for download at: http://nlp.jhu.edu/parabank. ... In addition to releasing hundreds of millions of English sentential paraphrases, we also release a free, pre-trained, model for monolingual sentential rewriting, as trained on PARABANK. |
| Open Datasets | Yes | The training data, Cz Eng 1.7 (Bojar et al., 2016)... We apply the same pipeline to the 109 word French English parallel corpus (Giga) (Callison-Burch et al., 2009). |
| Dataset Splits | No | The paper discusses data sampling for human evaluation ('We randomly sampled 100 Czech-English sentence pairs from each of the four English token lengths...') but does not provide explicit details about the train/validation/test splits used for training the neural machine translation model itself from the Cz Eng 1.7 or Giga corpora. |
| Hardware Specification | Yes | We trained the model on 2 Nvidia GTX 1080Ti for two weeks. |
| Software Dependencies | No | The paper mentions software tools like Sockeye (Hieber et al., 2017), spaCy (Honnibal and Montani, 2017), and MorphoDiTa (Straková, Straka, and Hajíč, 2014) along with their respective citation years, but does not provide specific version numbers for these software dependencies as required for reproducibility. |
| Experiment Setup | Yes | The model s encoder and decoder are both 6-layer LSTMs with a hidden size of 1024 and an embedding size of 512. Additionally, the model has one dot-attention layer. |