MRD-Net: Multi-Modal Residual Knowledge Distillation for Spoken Question Answering

Authors: Chenyu You, Nuo Chen, Yuexian Zou

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that the proposed MRD-Net achieves superior results compared with state-of-the-art methods on three spoken question answering benchmark datasets.
Researcher Affiliation Collaboration Chenyu You1 , Nuo Chen2 , Yuexian Zou2,3 1Department of Electrical Engineering, Yale University, USA 2ADSPLAB, School of ECE, Peking University, Shenzhen, China 3Peng Cheng Laboratory, Shenzhen, China
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing open-source code or links to a code repository.
Open Datasets Yes Spoken-SQu AD [Li et al., 2018] is an English listening comprehension dataset... FGC 2018 Formosa Grand Challenge (FGC) dataset 1 is a Mandarin Chinese spoken multi-choice question answering (MCQA) dataset... Spoken-Co QA [You et al., 2020a] is an English spoken conversational question answering (SCQA) dataset... and 1https://fgc.stpi.narl.org.tw/activity/techai2018
Dataset Splits Yes Spoken-SQu AD [Li et al., 2018] is an English listening comprehension dataset, which contains 37k ASR transcripts question pairs in the training set and 5.4k in the testing set, respectively. FGC 2018 Formosa Grand Challenge (FGC) dataset 1 is a Mandarin Chinese spoken multi-choice question answering (MCQA) dataset, which includes 7k passage-question-choices (PQC) pairs as the training set and 1.5k as the development set, respectively. Spoken-Co QA [You et al., 2020a] is an English spoken conversational question answering (SCQA) dataset, which consists of 40k and 3.8k question-answer pairs from 4k conversations in the training set and 380 conversations in test set from seven diverse domains, respectively.
Hardware Specification Yes We train our student model using 2 NVIDIA 2080Ti GPU.
Software Dependencies No The paper mentions software components like 'BPE as the tokenizer', 'VQ-Wav2Vec as tokenizer', 'Adam W optimizer', and 'Kaldi toolkit' but does not provide specific version numbers for these dependencies.
Experiment Setup Yes The maximum sequence lengths of T and S are 512, and the Audio-A is 1024. We utilize Adam W optimizer in training, and the learning rate is set to 8e-6. All models are trained using 4 as the batch size. The hyperparameter τ and α are set to 1 and 0.9, respectively.