An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems

Authors: Hitesh Tulsiani, David Chan, Shalini Ghosh, Garima Lalwani, Prabhat Pandey, Ankish Bansal, Sri Garimella, Ariya Rastrow, Björn Hoffmeister

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that leveraging our new framework compared to traditional training leads to relative WER reductions of close to 10% in real-world dialog systems, and up to 26% on public synthetic data.
Researcher Affiliation Collaboration *Equal contribution 1Amazon AGI 2UC Berkeley (work done while at Amazon). Correspondence to: Shalini Ghosh <ghoshsha@amazon.com>.
Pseudocode Yes An overview of the Ohm approach is given in Algorithm 1.
Open Source Code No The paper does not provide explicit statements or links to open-source code for the described methodology.
Open Datasets Yes Following Chan et al. (2024), we further evaluate our models on the open directed dialogue dataset (OD3). The OD3 dataset is a semi-synthetic dataset, where human-generated task-oriented dialogues from several popular data sets are augmented with LLM-generated conversational errors and computer-generated TTS audio. OD3 contains 620K turns of audio (approximately 1,172 hours).
Dataset Splits Yes We create two datasets for evaluation (1) ALL: All transcribed utterances across all validation dialogues (60K utterances) and (2) REF: A subset of ALL containing only utterances that lead to user reformulations of the query (8.5K utterances).
Hardware Specification Yes across either 64 P100 GPUs (for 200M model) or 64 A100 GPUs (for 1B model).
Software Dependencies No The paper mentions software components and models like "Conformer", "BERT", "Sentence Piece", "Adam optimizer", and "BIRCH clustering algorithm", but does not provide specific version numbers for these or for the overall software environment (e.g., Python, PyTorch versions).
Experiment Setup Yes Our teacher model is pre-trained using the PRETRAIN dataset for 500K iterations, using a per-gpu batch size ranging from 32 to 1, depending on the length of the sequence... We pre-train using an Adam optimizer we linearly increase the learning rate for 5000 steps and thereafter decrease it proportionally to the inverse square root of the step... and use magnitude-based gradient clipping with a value of 10. We then fine-tune our teacher models for 150k steps, using an Adam optimizer with gradient clipping, featuring a learning rate decay schedule that starts at 1e 8, holds at 1e 5, and decays to 1e 6, with the clipping norm set to 0.3...