Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On Robustness to Missing Video for Audiovisual Speech Recognition

Authors: Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites.
Researcher Affiliation	Industry	Oscar Chang EMAIL Google LLC, USA Otavio Braga EMAIL Google LLC, USA Hank Liao EMAIL Google LLC, USA Dmitriy Serdyuk EMAIL Google LLC, USA Olivier Siohan EMAIL Google LLC, USA
Pseudocode	No	The paper describes the methods (Dropout, Cascades) in prose and uses diagrams (Figure 1) to illustrate model architectures, but it does not contain explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	In the Supplementary materials, we have attached an example script written in tensorflow code to show how our robustness test suites can be generated.
Open Datasets	Yes	Training Data We closely adhere to the process outlined by Makino et al. (2019) to create a large-scale dataset containing 100,000 hours of AV data from public You Tube videos. This is done by mining segments of videos where the force-aligned user uploaded transcript matches a production quality ASR system with high conﬁdence. Then, Sync Net (Chung & Zisserman, 2016) is applied as a ﬁltering step to ensure that the video track also matches the audio track with high conﬁdence. [...] from the Noise X corpus (Varga & Steeneken, 1993). [...] F Experiments on TED LRS3
Dataset Splits	Yes	Training Data We closely adhere to the process outlined by Makino et al. (2019) to create a large-scale dataset containing 100,000 hours of AV data from public You Tube videos. [...] Test Data As discussed in Section 4.3, robustness should be treated not as a universal property of a model, but as an empirical claim about its performance on a particular test set. We benchmark our models on a separate test set of You Tube videos that contains 550 hours of professionally transcribed audiovisual clips (27,353 utterances and 342,507 words), under varying amounts of artiﬁcially added babble noise (clean, 20db, 10db, 0db) from the Noise X corpus (Varga & Steeneken, 1993).
Hardware Specification	No	The paper mentions training models and performing experiments but does not specify any hardware details such as GPU models, CPU types, or other computing infrastructure used.
Software Dependencies	No	In the Supplementary materials, we have attached an example script written in tensorflow code to show how our robustness test suites can be generated. (This indicates TensorFlow, but no version is provided).
Experiment Setup	Yes	Optimization All our models are trained in exactly the same way: Adam with β1 = 0.9, β2 = 0.97, batch size 4096, and learning rate 0.001 for a total of 500k steps with linear warmup in the ﬁrst 10k steps and an exponential decay to the smaller learning rate of 0.0001 from steps 300k to 400k. Two-Pass uses the same optimization setup for both training passes.