A Full-duplex Speech Dialogue Scheme Based On Large Language Model
Authors: Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Wei Xia, Yuanjun Xiong
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response latency by more than 3 folds compared with LLM-based half-duplex dialogue systems while responding within less than 500 milliseconds in more than 50% of evaluated interactions. |
| Researcher Affiliation | Industry | Peng Wang Songshuo Lu Yaohua Tang Sijie Yan Wei Xia Yuanjun Xiong MThreads AI w8ngp1ng@gmail.com, lusongshuo97@gmail.com, tangyaohua28@gmail.com yysijie@gmail.com, weixiaee@gmail.com, bitxiong@gmail.com |
| Pseudocode | No | The paper describes the system architecture and module operations but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | We have uploaded our code, along with the prompts utilized for generating the data. Anyone with access to GPT-4 can recreate the same dataset as ours. |
| Open Datasets | Yes | To devise a training dataset emulating the working environment of the LLM in the system, we instruct GPT-4 [Achiam et al., 2023] to write a set of dialogue transcripts... More details on dataset construction are discussed in Appendix B. ... we construct a new benchmark dataset for automated evaluation... To start, we filter about 1,000 sessions of multi-turn dialogues from the share GPT9 dataset. Following that, GPT-4 is used to generate another 1,000 single and multi-turn oral dialogues... We named the combined benchmark dataset "duplex-dialogue-3k". |
| Dataset Splits | No | Based on Llama-3-8B-Instruct8, we perform supervised fine-tuning [Ouyang et al., 2022] for 20 steps on this dataset. The paper mentions training and a benchmark dataset but does not explicitly define a validation split. |
| Hardware Specification | Yes | The fine-tuning is conducted on 8 NVIDIA A100 GPUs... All models are deployed on one single NVIDIA A100 GPU. |
| Software Dependencies | No | In our experiments, we use the Llama-3-8B-Instruct model as the basis standard LLM model... For non-streaming ASR models, we use the Open AI open-source version of the Whisper [Radford et al., 2023] model... for the streaming ASR model, we use a open-source10 U2++ Conformer [Gulati et al., 2020, Wu et al., 2021] model... For non-streaming TTS models, we use VITS [Kim et al., 2021] as the base model and the streaming TTS model uses the XTTS-v211 model from COQUI-AI. No explicit version numbers for these or other software dependencies are provided. |
| Experiment Setup | Yes | Based on Llama-3-8B-Instruct8, we perform supervised fine-tuning [Ouyang et al., 2022] for 20 steps on this dataset. The fine-tuning is conducted on 8 NVIDIA A100 GPUs with a batch size of 256 sequences and a learning rate of 1e-5 with the Adam W optimizer. |