Generating More Audios for End-to-End Spoken Language Understanding
Authors: Xuxin Cheng, Yuexian Zou
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | All experiments are conducted on the monolingual SLU dataset SLURP and the multilingual SLU dataset MINDS-14. Experimental results show that our method outperforms the previous best Textless End-to-end SLU models and can obtain a comparable performance with these models trained with the assistance of the corresponding transcripts. |
| Researcher Affiliation | Academia | Xuxin Cheng and Yuexian Zou School of ECE, Peking University, China chengxx@stu.pku.edu.cn, zouyx@pku.edu.cn |
| Pseudocode | No | The paper describes the system and methods using natural language and diagrams (e.g., Figure 2 and 3), but it does not include any formally structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides URLs for external pre-trained models (e.g., Hu BERT and En Codec on GitHub) and datasets, but it does not include an explicit statement or a link to the open-source code for the GMA-SLU methodology described in this paper. |
| Open Datasets | Yes | We conduct all the experiments on a monolingual SLU benchmark dataset SLURP3[Bastianelli et al., 2020] and the multilingual SLU dataset MINDS-144[Gerz et al., 2021]. (Footnote 3: https://github.com/pswietojanski/slurp; Footnote 4: https://huggingface.co/datasets/Poly AI/minds14) |
| Dataset Splits | Yes | We report the results of 4 languages, including en-US, fr-FR, pl-PL, and ko KR with a 30-20-50% train-dev-test split following the previous work for a fair comparison [Conneau et al., 2022]. If the loss on dev set does not decrease for 5 epochs, the training process will early-stop to avoid overfitting. For all experiments, we choose the model that achieves the best performance on the dev set and evaluate it on the test set. |
| Hardware Specification | Yes | All experiments are conducted at an Nvidia V100. |
| Software Dependencies | No | The paper mentions using specific models like Hu BERT and optimizers like Adam, and refers to external codebases, but it does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries used for implementation. |
| Experiment Setup | Yes | Following previous work [Wang et al., 2021], the batch size is set to 16. We apply the Adam optimizer [Kingma and Ba, 2015], and 4k warm-up updates to optimize parameters, where the learning rate is increased from 4e-4 to 2e-3. If the loss on dev set does not decrease for 5 epochs, the training process will early-stop to avoid overfitting. The weight α is set to 0.9. For the semantic tokens, we utilize the pre-trained quantized model where the number of clusters K is 100 to convert the audios to semantic tokens. During the data filtration stage, we set the threshold ρ to 80. |