Uncertainty Estimation in Autoregressive Structured Prediction

Authors: Andrey Malinin, Mark Gales

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work also provides baselines for token-level and sequence-level error detection, and sequence-level out-of-domain input detection on the WMT 14 English-French and WMT 17 English-German translation and Libri Speech speech recognition datasets.
Researcher Affiliation Collaboration Andrey Malinin Yandex, Higher School of Economics am969@yandex-team.ru Mark Gales ALTA Institute, University of Cambridge mjfg@eng.cam.ac.uk
Pseudocode No The paper does not contain pseudocode or a clearly labeled algorithm block.
Open Source Code No The paper states that 'Standard Fairseq (Ott et al., 2019) implementations of all models are used' but does not provide open-source code for the specific methodology developed in the paper.
Open Datasets Yes Models were trained on the full 960 hours of the Libri Speech dataset (Panayotov et al., 2015) in exactly the same configuration as described in (Mohamed et al., 2019). NMT models were trained on the WMT 14 English-French and WMT 17 English-German datasets.
Dataset Splits Yes Libri Speech: Dev-Clean 5.4 2703 17.8 Dev-Other 5.3 2864 18.9. For NMT, Newstest14 was treated as in-domain data, with details provided in Appendix B.2 about using the 'standard Fairseq... recipe, which is consistent with the baseline setup in described in (Ott et al., 2018b)' implying standard splits.
Hardware Specification Yes Training took 8 days using 8 V100 GPUs.
Software Dependencies No The paper mentions 'Standard Fairseq (Ott et al., 2019) implementations of all models are used', but does not specify a version number for Fairseq or any other software dependencies.
Experiment Setup Yes Specifically, models were trained at a fixed learning rate for 80 epochs, where an epoch is a full pass through the entire training set. Checkpoints over the last 30 epochs were averaged together. Beam-width for NMT and ASR models is 5 and 20, respectively. Models trained on WMT 17 English-German were trained for 193000 steps of gradient descent, which corresponds to roughly 49 epochs, while WMT 14 English-French models were trained for 800000 steps of gradient descent, which corresponds to roughly 19 epochs. Models were checkpoint-averaged across the last 10 epochs. All models were trained using mixed-precision training.