Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation

Authors: Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, Sungroh Yoon, Kang Min Yoo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Automatic and human evaluations on the Daily Talk dataset demonstrate that our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines.
Researcher Affiliation Collaboration 1Data Science and AI Lab, Department of ECE, Seoul National University 2NAVER Cloud 3NAVER AI Lab 4Artificial Intelligence Institute, Seoul National University 5ASRI, INMC, ISRC, and Interdisciplinary Program in AI, Seoul National University
Pseudocode No The paper does not include a figure, block, or section explicitly labeled "Pseudocode" or "Algorithm". Figure 5 shows a fine-tuning template, not pseudocode.
Open Source Code Yes Our code and checkpoints are available at https://github.com/naverai/usdm.
Open Datasets Yes Daily Talk [70]
Dataset Splits Yes We follow the train/test split of Lee et al. [70] and preprocess the data for single-turn spoken dialog. As a result, we obtain a total of 20,117 training samples and 1,058 test samples.
Hardware Specification Yes 64 NVIDIA A100-40GB GPUs
Software Dependencies No The paper mentions several models and tools, some with links to their repositories (Table 6), but it does not provide specific version numbers for general software components like Python, PyTorch, or CUDA, which are typically required for full reproducibility.
Experiment Setup Yes batch size of 256. We use the Adam optimizer [73] with a learning rate of 10 4.