Exploring Transfer Learning For End-to-End Spoken Language Understanding

Authors: Subendhu Rongali, Beiye Liu, Liwei Cai, Konstantine Arkoudas, Chengwei Su, Wael Hamza13754-13761

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments are carried out on a combination of internal and publicly available datasets. We describe them here.
Researcher Affiliation Collaboration 1 Amazon Alexa AI, New York 2 University of Massachusetts Amherst srongali@cs.umass.edu, {beiyeliu, cliwei, arkoudk, chengwes, waelhamz}@amazon.com
Pseudocode No The paper describes the model architecture and training process in detail, but it does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper states, 'We release the audio data collected for the TOP dataset for future research' and provides a link to a dataset, but it does not explicitly state that the source code for the AT-AT model or its methodology is being released.
Open Datasets Yes We also compile an ASR dataset by downloading all splits of the publicly available Libri Speech dataset (Panayotov et al. 2015), giving us 1000 hours of data. We use two public datasets: Fluent Speech (Lugosch et al. 2019), and SNIPS Audio (Saade et al. 2018) in our evaluation. We also construct a zeroshot dataset from the publicly available Facebook TOP (Gupta et al. 2018) dataset.
Dataset Splits Yes This dataset contains about 3M training utterances, 100k validation, and 100k testing utterances comprising 23 intents and 95 slots. There are about 23k train, 3k valid, and 4k test examples in this dataset. It contains 32k train, 4k eval, and 9k test utterances.
Hardware Specification No The paper describes model architectures and features used (e.g., '80-dim LFB features', '2-layer 2D CNN', 'transformer encoder'), but it does not provide specific hardware details such as GPU/CPU models or memory used for running the experiments.
Software Dependencies No The paper mentions standard components like 'adam optimizer' and 'cross entropy loss' and references tools like 'Gentle' and 'Amazon Polly', but it does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes We use 80-dim LFB features to process the audio signals. ... We use a 2-layer 2D CNN with 256 final units and a transformer encoder with 12 layers, 4 heads, 256 units, and 2048 hidden units as our audio encoder. ... We use noam learning rate schedule with 4000 warm-up steps and an adam optimizer with learning rate 1. We use cross entropy loss with label smoothing (ϵ = 0.1) as our loss function. During inference, we use beam search with a beam-size of 4.