reproducibilityindex.ai

A Full-duplex Speech Dialogue Scheme Based On Large Language Model

Authors: Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Wei Xia, Yuanjun Xiong

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response latency by more than 3 folds compared with LLM-based half-duplex dialogue systems while responding within less than 500 milliseconds in more than 50% of evaluated interactions.
Researcher Affiliation	Industry	Peng Wang Songshuo Lu Yaohua Tang Sijie Yan Wei Xia Yuanjun Xiong MThreads AI w8ngp1ng@gmail.com, lusongshuo97@gmail.com, tangyaohua28@gmail.com yysijie@gmail.com, weixiaee@gmail.com, bitxiong@gmail.com
Pseudocode	No	The paper describes the system architecture and module operations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	We have uploaded our code, along with the prompts utilized for generating the data. Anyone with access to GPT-4 can recreate the same dataset as ours.
Open Datasets	Yes	To devise a training dataset emulating the working environment of the LLM in the system, we instruct GPT-4 [Achiam et al., 2023] to write a set of dialogue transcripts... More details on dataset construction are discussed in Appendix B. ... we construct a new benchmark dataset for automated evaluation... To start, we ﬁlter about 1,000 sessions of multi-turn dialogues from the share GPT9 dataset. Following that, GPT-4 is used to generate another 1,000 single and multi-turn oral dialogues... We named the combined benchmark dataset "duplex-dialogue-3k".
Dataset Splits	No	Based on Llama-3-8B-Instruct8, we perform supervised ﬁne-tuning [Ouyang et al., 2022] for 20 steps on this dataset. The paper mentions training and a benchmark dataset but does not explicitly define a validation split.
Hardware Specification	Yes	The ﬁne-tuning is conducted on 8 NVIDIA A100 GPUs... All models are deployed on one single NVIDIA A100 GPU.
Software Dependencies	No	In our experiments, we use the Llama-3-8B-Instruct model as the basis standard LLM model... For non-streaming ASR models, we use the Open AI open-source version of the Whisper [Radford et al., 2023] model... for the streaming ASR model, we use a open-source10 U2++ Conformer [Gulati et al., 2020, Wu et al., 2021] model... For non-streaming TTS models, we use VITS [Kim et al., 2021] as the base model and the streaming TTS model uses the XTTS-v211 model from COQUI-AI. No explicit version numbers for these or other software dependencies are provided.
Experiment Setup	Yes	Based on Llama-3-8B-Instruct8, we perform supervised ﬁne-tuning [Ouyang et al., 2022] for 20 steps on this dataset. The ﬁne-tuning is conducted on 8 NVIDIA A100 GPUs with a batch size of 256 sequences and a learning rate of 1e-5 with the Adam W optimizer.