Generating More Audios for End-to-End Spoken Language Understanding

Authors: Xuxin Cheng, Yuexian Zou

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental All experiments are conducted on the monolingual SLU dataset SLURP and the multilingual SLU dataset MINDS-14. Experimental results show that our method outperforms the previous best Textless End-to-end SLU models and can obtain a comparable performance with these models trained with the assistance of the corresponding transcripts.
Researcher Affiliation Academia Xuxin Cheng and Yuexian Zou School of ECE, Peking University, China chengxx@stu.pku.edu.cn, zouyx@pku.edu.cn
Pseudocode No The paper describes the system and methods using natural language and diagrams (e.g., Figure 2 and 3), but it does not include any formally structured pseudocode or algorithm blocks.
Open Source Code No The paper provides URLs for external pre-trained models (e.g., Hu BERT and En Codec on GitHub) and datasets, but it does not include an explicit statement or a link to the open-source code for the GMA-SLU methodology described in this paper.
Open Datasets Yes We conduct all the experiments on a monolingual SLU benchmark dataset SLURP3[Bastianelli et al., 2020] and the multilingual SLU dataset MINDS-144[Gerz et al., 2021]. (Footnote 3: https://github.com/pswietojanski/slurp; Footnote 4: https://huggingface.co/datasets/Poly AI/minds14)
Dataset Splits Yes We report the results of 4 languages, including en-US, fr-FR, pl-PL, and ko KR with a 30-20-50% train-dev-test split following the previous work for a fair comparison [Conneau et al., 2022]. If the loss on dev set does not decrease for 5 epochs, the training process will early-stop to avoid overfitting. For all experiments, we choose the model that achieves the best performance on the dev set and evaluate it on the test set.
Hardware Specification Yes All experiments are conducted at an Nvidia V100.
Software Dependencies No The paper mentions using specific models like Hu BERT and optimizers like Adam, and refers to external codebases, but it does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries used for implementation.
Experiment Setup Yes Following previous work [Wang et al., 2021], the batch size is set to 16. We apply the Adam optimizer [Kingma and Ba, 2015], and 4k warm-up updates to optimize parameters, where the learning rate is increased from 4e-4 to 2e-3. If the loss on dev set does not decrease for 5 epochs, the training process will early-stop to avoid overfitting. The weight α is set to 0.9. For the semantic tokens, we utilize the pre-trained quantized model where the number of clusters K is 100 to convert the audios to semantic tokens. During the data filtration stage, we set the threshold ρ to 80.