Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MLE-Guided Parameter Search for Task Loss Minimization in Neural Sequence Modeling
Authors: Sean Welleck, Kyunghyun Cho14032-14040
AAAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation. 5 Experiments 5.1 Text Completion with GPT-2 Table 1: Text completion results (GPT-2, Wikitext-103 test set) |
| Researcher Affiliation | Academia | Sean Welleck, Kyunghyun Cho New York University Correspondence to: EMAIL. |
| Pseudocode | Yes | Algorithm 1: MLE-guided parameter search (MGS). |
| Open Source Code | Yes | Code available at https://github.com/wellecks/mgs. |
| Open Datasets | Yes | We use the Wikitext-103 dataset (Merity et al. 2016) We experiment on the IWSLT 14 German to English task (Cettolo et al. 2014) |
| Dataset Splits | Yes | The resulting dataset consists of 874,556 training, 1,896 validation, and 2,162 test pairs. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'fairseq' but does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | We use 4 candidates, and compute training task loss with a max decoding length of 1.3 times the ground-truth length. Models are evaluated with a max decoding length of 500 tokens. We performed a grid search using α {0.1, 0.3, 0.5}, selecting α based on the validation task loss that the model is optimizing. We use 4 candidates and a grid search over noise ({0.01, 0.1, 1.0}) and α ({1.0, 10.0, 100.0}). For fine-tuning, we use a batch size of 16k tokens, and accumulate gradients for 4 iterations. |