Uncertainty Estimation in Autoregressive Structured Prediction
Authors: Andrey Malinin, Mark Gales
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work also provides baselines for token-level and sequence-level error detection, and sequence-level out-of-domain input detection on the WMT 14 English-French and WMT 17 English-German translation and Libri Speech speech recognition datasets. |
| Researcher Affiliation | Collaboration | Andrey Malinin Yandex, Higher School of Economics am969@yandex-team.ru Mark Gales ALTA Institute, University of Cambridge mjfg@eng.cam.ac.uk |
| Pseudocode | No | The paper does not contain pseudocode or a clearly labeled algorithm block. |
| Open Source Code | No | The paper states that 'Standard Fairseq (Ott et al., 2019) implementations of all models are used' but does not provide open-source code for the specific methodology developed in the paper. |
| Open Datasets | Yes | Models were trained on the full 960 hours of the Libri Speech dataset (Panayotov et al., 2015) in exactly the same configuration as described in (Mohamed et al., 2019). NMT models were trained on the WMT 14 English-French and WMT 17 English-German datasets. |
| Dataset Splits | Yes | Libri Speech: Dev-Clean 5.4 2703 17.8 Dev-Other 5.3 2864 18.9. For NMT, Newstest14 was treated as in-domain data, with details provided in Appendix B.2 about using the 'standard Fairseq... recipe, which is consistent with the baseline setup in described in (Ott et al., 2018b)' implying standard splits. |
| Hardware Specification | Yes | Training took 8 days using 8 V100 GPUs. |
| Software Dependencies | No | The paper mentions 'Standard Fairseq (Ott et al., 2019) implementations of all models are used', but does not specify a version number for Fairseq or any other software dependencies. |
| Experiment Setup | Yes | Specifically, models were trained at a fixed learning rate for 80 epochs, where an epoch is a full pass through the entire training set. Checkpoints over the last 30 epochs were averaged together. Beam-width for NMT and ASR models is 5 and 20, respectively. Models trained on WMT 17 English-German were trained for 193000 steps of gradient descent, which corresponds to roughly 49 epochs, while WMT 14 English-French models were trained for 800000 steps of gradient descent, which corresponds to roughly 19 epochs. Models were checkpoint-averaged across the last 10 epochs. All models were trained using mixed-precision training. |