Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Controlling Neural Machine Translation Formality with Synthetic Supervision
Authors: Xing Niu, Marine Carpuat8568-8575
AAAI 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We design experiments to evaluate the impact of our approaches to (1) formality control, and (2) synthetic supervision. We conduct a comprehensive automatic and human evaluation of the resulting FSMT systems. |
| Researcher Affiliation | Collaboration | Xing Niu,1 Marine Carpuat2 1Amazon AWS AI, 2University of Maryland |
| Pseudocode | No | The paper describes its algorithms and processes in textual paragraphs and through diagrams, but it does not include formal pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Source code: https://github.com/xingniu/multitask-ft-fsmt. |
| Open Datasets | Yes | We use the GYAFC corpus introduced by Rao and Tetreault (2018) in all tasks. We train MT systems on the concatenation of large diverse parallel corpora: (1) Europarl.v7 (Koehn 2005); (2) News-Commentary.v14 (Bojar et al. 2018); (3) Open Subtitles2016 (Lison and Tiedemann 2016). |
| Dataset Splits | Yes | The train split consists of 105K informal-formal sentence pairs whereas the dev/test sets consist of roughly 10K/5K pairs for both formality transfer directions, i.e., I F and F I. The learning rate for baseline models is initialized to 0.001 and reduced by 30% after 4 checkpoints without improvement of perplexity on the development set. |
| Hardware Specification | No | The paper describes the model architecture and training process (e.g., 'bidirectional encoder with a single LSTM layer'), but it does not specify any hardware components such as GPU models, CPU types, or cloud computing instances used for the experiments. |
| Software Dependencies | No | The paper mentions software like 'Sockeye toolkit' and 'Adam optimizer,' but it does not provide specific version numbers for these or other software dependencies required to reproduce the experiments. |
| Experiment Setup | Yes | Our translation model uses a bidirectional encoder with a single LSTM layer of size 512, multilayer perceptron attention with a layer size of 512, and word representations of size 512. We apply layer normalization [...], add dropout to embeddings and RNNs [...] with probability 0.2, and tie the source and target embeddings [...]. We train using the Adam optimizer [...] with a batch size of 64 sentences and we checkpoint the model every 1000 updates. The learning rate for baseline models is initialized to 0.001 and reduced by 30% after 4 checkpoints without improvement of perplexity on the development set. Training stops after 10 checkpoints without improvement. [...] inheriting all settings except the learning rate which is re-initialized to 0.0001. The hyperparameter α in Equation 5 is set to 0.05. |