Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Robustness to Missing Video for Audiovisual Speech Recognition

Authors: Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites.
Researcher Affiliation Industry Oscar Chang EMAIL Google LLC, USA Otavio Braga EMAIL Google LLC, USA Hank Liao EMAIL Google LLC, USA Dmitriy Serdyuk EMAIL Google LLC, USA Olivier Siohan EMAIL Google LLC, USA
Pseudocode No The paper describes the methods (Dropout, Cascades) in prose and uses diagrams (Figure 1) to illustrate model architectures, but it does not contain explicit pseudocode blocks or algorithm listings.
Open Source Code Yes In the Supplementary materials, we have attached an example script written in tensorflow code to show how our robustness test suites can be generated.
Open Datasets Yes Training Data We closely adhere to the process outlined by Makino et al. (2019) to create a large-scale dataset containing 100,000 hours of AV data from public You Tube videos. This is done by mining segments of videos where the force-aligned user uploaded transcript matches a production quality ASR system with high confidence. Then, Sync Net (Chung & Zisserman, 2016) is applied as a filtering step to ensure that the video track also matches the audio track with high confidence. [...] from the Noise X corpus (Varga & Steeneken, 1993). [...] F Experiments on TED LRS3
Dataset Splits Yes Training Data We closely adhere to the process outlined by Makino et al. (2019) to create a large-scale dataset containing 100,000 hours of AV data from public You Tube videos. [...] Test Data As discussed in Section 4.3, robustness should be treated not as a universal property of a model, but as an empirical claim about its performance on a particular test set. We benchmark our models on a separate test set of You Tube videos that contains 550 hours of professionally transcribed audiovisual clips (27,353 utterances and 342,507 words), under varying amounts of artificially added babble noise (clean, 20db, 10db, 0db) from the Noise X corpus (Varga & Steeneken, 1993).
Hardware Specification No The paper mentions training models and performing experiments but does not specify any hardware details such as GPU models, CPU types, or other computing infrastructure used.
Software Dependencies No In the Supplementary materials, we have attached an example script written in tensorflow code to show how our robustness test suites can be generated. (This indicates TensorFlow, but no version is provided).
Experiment Setup Yes Optimization All our models are trained in exactly the same way: Adam with β1 = 0.9, β2 = 0.97, batch size 4096, and learning rate 0.001 for a total of 500k steps with linear warmup in the first 10k steps and an exponential decay to the smaller learning rate of 0.0001 from steps 300k to 400k. Two-Pass uses the same optimization setup for both training passes.