An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems
Authors: Hitesh Tulsiani, David Chan, Shalini Ghosh, Garima Lalwani, Prabhat Pandey, Ankish Bansal, Sri Garimella, Ariya Rastrow, Björn Hoffmeister
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that leveraging our new framework compared to traditional training leads to relative WER reductions of close to 10% in real-world dialog systems, and up to 26% on public synthetic data. |
| Researcher Affiliation | Collaboration | *Equal contribution 1Amazon AGI 2UC Berkeley (work done while at Amazon). Correspondence to: Shalini Ghosh <ghoshsha@amazon.com>. |
| Pseudocode | Yes | An overview of the Ohm approach is given in Algorithm 1. |
| Open Source Code | No | The paper does not provide explicit statements or links to open-source code for the described methodology. |
| Open Datasets | Yes | Following Chan et al. (2024), we further evaluate our models on the open directed dialogue dataset (OD3). The OD3 dataset is a semi-synthetic dataset, where human-generated task-oriented dialogues from several popular data sets are augmented with LLM-generated conversational errors and computer-generated TTS audio. OD3 contains 620K turns of audio (approximately 1,172 hours). |
| Dataset Splits | Yes | We create two datasets for evaluation (1) ALL: All transcribed utterances across all validation dialogues (60K utterances) and (2) REF: A subset of ALL containing only utterances that lead to user reformulations of the query (8.5K utterances). |
| Hardware Specification | Yes | across either 64 P100 GPUs (for 200M model) or 64 A100 GPUs (for 1B model). |
| Software Dependencies | No | The paper mentions software components and models like "Conformer", "BERT", "Sentence Piece", "Adam optimizer", and "BIRCH clustering algorithm", but does not provide specific version numbers for these or for the overall software environment (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | Our teacher model is pre-trained using the PRETRAIN dataset for 500K iterations, using a per-gpu batch size ranging from 32 to 1, depending on the length of the sequence... We pre-train using an Adam optimizer we linearly increase the learning rate for 5000 steps and thereafter decrease it proportionally to the inverse square root of the step... and use magnitude-based gradient clipping with a value of 10. We then fine-tune our teacher models for 150k steps, using an Adam optimizer with gradient clipping, featuring a learning rate decay schedule that starts at 1e 8, holds at 1e 5, and decays to 1e 6, with the clipping norm set to 0.3... |