MLE-Guided Parameter Search for Task Loss Minimization in Neural Sequence Modeling

Authors: Sean Welleck, Kyunghyun Cho14032-14040

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation. 5 Experiments 5.1 Text Completion with GPT-2 Table 1: Text completion results (GPT-2, Wikitext-103 test set)
Researcher Affiliation Academia Sean Welleck, Kyunghyun Cho New York University Correspondence to: wellecks@nyu.edu.
Pseudocode Yes Algorithm 1: MLE-guided parameter search (MGS).
Open Source Code Yes Code available at https://github.com/wellecks/mgs.
Open Datasets Yes We use the Wikitext-103 dataset (Merity et al. 2016) We experiment on the IWSLT 14 German to English task (Cettolo et al. 2014)
Dataset Splits Yes The resulting dataset consists of 874,556 training, 1,896 validation, and 2,162 test pairs.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions 'fairseq' but does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes We use 4 candidates, and compute training task loss with a max decoding length of 1.3 times the ground-truth length. Models are evaluated with a max decoding length of 500 tokens. We performed a grid search using α {0.1, 0.3, 0.5}, selecting α based on the validation task loss that the model is optimizing. We use 4 candidates and a grid search over noise ({0.01, 0.1, 1.0}) and α ({1.0, 10.0, 100.0}). For fine-tuning, we use a batch size of 16k tokens, and accumulate gradients for 4 iterations.