Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
Authors: Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, Sungroh Yoon, Kang Min Yoo
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Automatic and human evaluations on the Daily Talk dataset demonstrate that our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines. |
| Researcher Affiliation | Collaboration | 1Data Science and AI Lab, Department of ECE, Seoul National University 2NAVER Cloud 3NAVER AI Lab 4Artificial Intelligence Institute, Seoul National University 5ASRI, INMC, ISRC, and Interdisciplinary Program in AI, Seoul National University |
| Pseudocode | No | The paper does not include a figure, block, or section explicitly labeled "Pseudocode" or "Algorithm". Figure 5 shows a fine-tuning template, not pseudocode. |
| Open Source Code | Yes | Our code and checkpoints are available at https://github.com/naverai/usdm. |
| Open Datasets | Yes | Daily Talk [70] |
| Dataset Splits | Yes | We follow the train/test split of Lee et al. [70] and preprocess the data for single-turn spoken dialog. As a result, we obtain a total of 20,117 training samples and 1,058 test samples. |
| Hardware Specification | Yes | 64 NVIDIA A100-40GB GPUs |
| Software Dependencies | No | The paper mentions several models and tools, some with links to their repositories (Table 6), but it does not provide specific version numbers for general software components like Python, PyTorch, or CUDA, which are typically required for full reproducibility. |
| Experiment Setup | Yes | batch size of 256. We use the Adam optimizer [73] with a learning rate of 10 4. |