Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning
Authors: Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises. |
| Researcher Affiliation | Academia | Chen Chen1, Yuchen Hu1, Qiang Zhang2, 3, Heqing Zou1, Beier Zhu1, and Eng Siong Chng1 1School of Computer Science and Engineering, Nanyang Technological University 2ZJU-Hangzhou Global Scientific and Technological Innovation Center 3College of Computer Science and Technology, Zhejiang University |
| Pseudocode | Yes | Algorithm 1: Pseudocode for MSRL Training |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of the described methodology. |
| Open Datasets | Yes | We conduct the experiments on LRS3 (Afouras, Chung, and Zisserman 2018b), which is the largest publicly available dataset for audio-visual speech recognition task. It includes face tracks from over 400 hours of TED and TEDx videos from more than 5,000 speakers, along with the corresponding subtitles and word alignment boundaries. |
| Dataset Splits | Yes | The original training set is divided into 2 partitions: pretrain (403 hours) and trainval (30 hours), which are both from the same sources with test set (1452 utterances, 1 hour). In this paper, we randomly choose 1,200 utterances (1 hour) as a valid set for hyper-parameter tuning and best model selection. |
| Hardware Specification | No | The computational work for this article was (fully/partially) performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg). This mentions a computing resource but does not provide specific hardware details like GPU models or CPU types. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies used in the experiments. |
| Experiment Setup | Yes | MSRL Setup We develop several MSRL frameworks with different settings, as shown in Table 1. The small transformer block has 768/3072/12 of embedding dimension/feed-forward dimension/attention heads, and the large transformer block increases to 2034/4096/16 respectively. ... The normal-resource contains 433 hours of full training data (pretrain subset and trainval subset), and the low-resource only contains 30 hours of training data (trainval subset). |