A Full-duplex Speech Dialogue Scheme Based On Large Language Model

Authors: Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Wei Xia, Yuanjun Xiong

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response latency by more than 3 folds compared with LLM-based half-duplex dialogue systems while responding within less than 500 milliseconds in more than 50% of evaluated interactions.
Researcher Affiliation Industry Peng Wang Songshuo Lu Yaohua Tang Sijie Yan Wei Xia Yuanjun Xiong MThreads AI w8ngp1ng@gmail.com, lusongshuo97@gmail.com, tangyaohua28@gmail.com yysijie@gmail.com, weixiaee@gmail.com, bitxiong@gmail.com
Pseudocode No The paper describes the system architecture and module operations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes We have uploaded our code, along with the prompts utilized for generating the data. Anyone with access to GPT-4 can recreate the same dataset as ours.
Open Datasets Yes To devise a training dataset emulating the working environment of the LLM in the system, we instruct GPT-4 [Achiam et al., 2023] to write a set of dialogue transcripts... More details on dataset construction are discussed in Appendix B. ... we construct a new benchmark dataset for automated evaluation... To start, we filter about 1,000 sessions of multi-turn dialogues from the share GPT9 dataset. Following that, GPT-4 is used to generate another 1,000 single and multi-turn oral dialogues... We named the combined benchmark dataset "duplex-dialogue-3k".
Dataset Splits No Based on Llama-3-8B-Instruct8, we perform supervised fine-tuning [Ouyang et al., 2022] for 20 steps on this dataset. The paper mentions training and a benchmark dataset but does not explicitly define a validation split.
Hardware Specification Yes The fine-tuning is conducted on 8 NVIDIA A100 GPUs... All models are deployed on one single NVIDIA A100 GPU.
Software Dependencies No In our experiments, we use the Llama-3-8B-Instruct model as the basis standard LLM model... For non-streaming ASR models, we use the Open AI open-source version of the Whisper [Radford et al., 2023] model... for the streaming ASR model, we use a open-source10 U2++ Conformer [Gulati et al., 2020, Wu et al., 2021] model... For non-streaming TTS models, we use VITS [Kim et al., 2021] as the base model and the streaming TTS model uses the XTTS-v211 model from COQUI-AI. No explicit version numbers for these or other software dependencies are provided.
Experiment Setup Yes Based on Llama-3-8B-Instruct8, we perform supervised fine-tuning [Ouyang et al., 2022] for 20 steps on this dataset. The fine-tuning is conducted on 8 NVIDIA A100 GPUs with a batch size of 256 sequences and a learning rate of 1e-5 with the Adam W optimizer.