Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On Robustness to Missing Video for Audiovisual Speech Recognition
Authors: Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. |
| Researcher Affiliation | Industry | Oscar Chang EMAIL Google LLC, USA Otavio Braga EMAIL Google LLC, USA Hank Liao EMAIL Google LLC, USA Dmitriy Serdyuk EMAIL Google LLC, USA Olivier Siohan EMAIL Google LLC, USA |
| Pseudocode | No | The paper describes the methods (Dropout, Cascades) in prose and uses diagrams (Figure 1) to illustrate model architectures, but it does not contain explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | In the Supplementary materials, we have attached an example script written in tensorflow code to show how our robustness test suites can be generated. |
| Open Datasets | Yes | Training Data We closely adhere to the process outlined by Makino et al. (2019) to create a large-scale dataset containing 100,000 hours of AV data from public You Tube videos. This is done by mining segments of videos where the force-aligned user uploaded transcript matches a production quality ASR system with high confidence. Then, Sync Net (Chung & Zisserman, 2016) is applied as a filtering step to ensure that the video track also matches the audio track with high confidence. [...] from the Noise X corpus (Varga & Steeneken, 1993). [...] F Experiments on TED LRS3 |
| Dataset Splits | Yes | Training Data We closely adhere to the process outlined by Makino et al. (2019) to create a large-scale dataset containing 100,000 hours of AV data from public You Tube videos. [...] Test Data As discussed in Section 4.3, robustness should be treated not as a universal property of a model, but as an empirical claim about its performance on a particular test set. We benchmark our models on a separate test set of You Tube videos that contains 550 hours of professionally transcribed audiovisual clips (27,353 utterances and 342,507 words), under varying amounts of artificially added babble noise (clean, 20db, 10db, 0db) from the Noise X corpus (Varga & Steeneken, 1993). |
| Hardware Specification | No | The paper mentions training models and performing experiments but does not specify any hardware details such as GPU models, CPU types, or other computing infrastructure used. |
| Software Dependencies | No | In the Supplementary materials, we have attached an example script written in tensorflow code to show how our robustness test suites can be generated. (This indicates TensorFlow, but no version is provided). |
| Experiment Setup | Yes | Optimization All our models are trained in exactly the same way: Adam with β1 = 0.9, β2 = 0.97, batch size 4096, and learning rate 0.001 for a total of 500k steps with linear warmup in the first 10k steps and an exponential decay to the smaller learning rate of 0.0001 from steps 300k to 400k. Two-Pass uses the same optimization setup for both training passes. |