MLE-Guided Parameter Search for Task Loss Minimization in Neural Sequence Modeling
Authors: Sean Welleck, Kyunghyun Cho14032-14040
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation. 5 Experiments 5.1 Text Completion with GPT-2 Table 1: Text completion results (GPT-2, Wikitext-103 test set) |
| Researcher Affiliation | Academia | Sean Welleck, Kyunghyun Cho New York University Correspondence to: wellecks@nyu.edu. |
| Pseudocode | Yes | Algorithm 1: MLE-guided parameter search (MGS). |
| Open Source Code | Yes | Code available at https://github.com/wellecks/mgs. |
| Open Datasets | Yes | We use the Wikitext-103 dataset (Merity et al. 2016) We experiment on the IWSLT 14 German to English task (Cettolo et al. 2014) |
| Dataset Splits | Yes | The resulting dataset consists of 874,556 training, 1,896 validation, and 2,162 test pairs. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'fairseq' but does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | We use 4 candidates, and compute training task loss with a max decoding length of 1.3 times the ground-truth length. Models are evaluated with a max decoding length of 500 tokens. We performed a grid search using α {0.1, 0.3, 0.5}, selecting α based on the validation task loss that the model is optimizing. We use 4 candidates and a grid search over noise ({0.01, 0.1, 1.0}) and α ({1.0, 10.0, 100.0}). For fine-tuning, we use a batch size of 16k tokens, and accumulate gradients for 4 iterations. |