Untangling in Invariant Speech Recognition

Authors: Cory Stephenson, Jenelle Feather, Suchismita Padhy, Oguz Elibol, Hanlin Tang, Josh McDermott, SueYeon Chung

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To understand representation in speech models, we first train a neural network on a corpus of transcribed speech. Then, we use the trained models to extract per-layer representations at every time step on each corpus stimulus. Finally, we apply the mean-field theoretic manifold analysis technique [24, 25] (hereafter, MFTMA technique) to measure manifold capacity and other manifold geometric properties (radius, dimension, correlation) on a subsample of the test dataset. We examined two speech recognition models. The first model is a CNN model based on [14]... The second is an end-to-end ASR model, Deep Speech 2 (DS2) [5]...
Researcher Affiliation Collaboration Cory Stephenson Intel AI Lab cory.stephenson@intel.com Jenelle Feather MIT jfeather@mit.edu Suchismita Padhy Intel AI Lab suchismita.padhy@intel.com Oguz Elibol Intel AI Lab oguz.h.elibol@intel.com Hanlin Tang Intel AI Lab hanlin.tang@intel.com Josh Mc Dermott MIT/ Center for Brains, Minds, and Machines jhm@mit.edu Sue Yeon Chung Columbia University/ MIT sueyeon@columbia.edu
Pseudocode No The paper describes the methodology and models used but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our implementation of the analysis methods: https://github.com/schung039/neural_manifolds_replica MFT
Open Datasets Yes We trained the model on two tasks: word recognition and speaker recognition. For word recognition, we trained on two second segments from a combination of the WSJ Corpus [31] and Spoken Wikipedia Corpora [32], with noise augmentation from Audio Set backgrounds [33]. Our model was trained on the 960 hour training portion of the Libri Speech dataset [35]... For the comparison between character, phoneme, word, and parts-of-speech manifolds, similar manifold datasets were also constructed from TIMIT, which includes phoneme and word alignment.
Dataset Splits No The paper mentions training on the Libri Speech dataset and testing on its partitions: 'Our model was trained on the 960 hour training portion of the Libri Speech dataset [35], achieving a word error rate (WER) of 12%, and 22.7% respectively on the clean and other partitions of the test set without the use of a language model.' However, it does not explicitly specify a validation set or a detailed train/validation/test split for reproducibility.
Hardware Specification No The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions software like 'deep-speech.pytorch' (footnote 4) and 'batch normalization layers' but does not specify any version numbers for these or other software dependencies.
Experiment Setup No The paper states that 'For more training details, please see the SM' regarding the CNN model, and mentions the use of 'CTC loss function' for DS2. However, it does not explicitly provide concrete hyperparameters (e.g., learning rate, batch size, number of epochs) or detailed training configurations within the main text of the paper.