Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

Authors: Yonatan Belinkov, James Glass

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments shed light on important aspects of the end-to-end model such as layer depth, model complexity, and other design choices.
Researcher Affiliation Academia Yonatan Belinkov and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 {belinkov, glass}@mit.edu
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes The code for all of our experiments is publicly available.2 2http://github.com/boknilev/asr-repr-analysis
Open Datasets Yes The end-to-end models are trained on Libri Speech [34], a publicly available corpus of English read speech, containing 1,000 hours sampled at 16k Hz. For the phoneme recognition task, we use TIMIT, which comes with time segmentation of phones.
Dataset Splits Yes We use the official train/development/test split and extract frames for the frame classification task. Table 2 summarizes statistics of the frame classification dataset.
Hardware Specification No No specific hardware details (e.g., CPU/GPU models, memory) used for experiments were mentioned in the paper.
Software Dependencies No The paper mentions "deepspeech.torch [33]" but does not provide a specific version number. No other specific software with version numbers were listed.
Experiment Setup Yes We model the classifier as a feed-forward neural network with one hidden layer, where the size of the hidden layer is set to 500. We train the classifier with Adam [32] with the recommended parameters ( = 0.001, β1 = 0.9, β2 = 0.999, = e 8) to minimize the cross-entropy loss. We use a batch size of 16, train the model for 30 epochs, and choose the model with the best development loss for evaluation.