reproducibilityindex.ai

Uncertainty Estimation in Autoregressive Structured Prediction

Authors: Andrey Malinin, Mark Gales

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work also provides baselines for token-level and sequence-level error detection, and sequence-level out-of-domain input detection on the WMT 14 English-French and WMT 17 English-German translation and Libri Speech speech recognition datasets.
Researcher Affiliation	Collaboration	Andrey Malinin Yandex, Higher School of Economics am969@yandex-team.ru Mark Gales ALTA Institute, University of Cambridge mjfg@eng.cam.ac.uk
Pseudocode	No	The paper does not contain pseudocode or a clearly labeled algorithm block.
Open Source Code	No	The paper states that 'Standard Fairseq (Ott et al., 2019) implementations of all models are used' but does not provide open-source code for the specific methodology developed in the paper.
Open Datasets	Yes	Models were trained on the full 960 hours of the Libri Speech dataset (Panayotov et al., 2015) in exactly the same conﬁguration as described in (Mohamed et al., 2019). NMT models were trained on the WMT 14 English-French and WMT 17 English-German datasets.
Dataset Splits	Yes	Libri Speech: Dev-Clean 5.4 2703 17.8 Dev-Other 5.3 2864 18.9. For NMT, Newstest14 was treated as in-domain data, with details provided in Appendix B.2 about using the 'standard Fairseq... recipe, which is consistent with the baseline setup in described in (Ott et al., 2018b)' implying standard splits.
Hardware Specification	Yes	Training took 8 days using 8 V100 GPUs.
Software Dependencies	No	The paper mentions 'Standard Fairseq (Ott et al., 2019) implementations of all models are used', but does not specify a version number for Fairseq or any other software dependencies.
Experiment Setup	Yes	Speciﬁcally, models were trained at a ﬁxed learning rate for 80 epochs, where an epoch is a full pass through the entire training set. Checkpoints over the last 30 epochs were averaged together. Beam-width for NMT and ASR models is 5 and 20, respectively. Models trained on WMT 17 English-German were trained for 193000 steps of gradient descent, which corresponds to roughly 49 epochs, while WMT 14 English-French models were trained for 800000 steps of gradient descent, which corresponds to roughly 19 epochs. Models were checkpoint-averaged across the last 10 epochs. All models were trained using mixed-precision training.