Attention-Based Models for Speech Recognition

Authors: Jan K. Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that while an adaptation of the model used for machine translation in [2] reaches a competitive 18.7% phoneme error rate (PER) on the TIMIT phoneme recognition task, it can only be applied to utterances which are roughly as long as the ones it was trained on. We offer a qualitative explanation of this failure and propose a novel and generic method of adding location-awareness to the attention mechanism to alleviate this issue. The new method yields a model that is robust to long inputs and achieves 18% PER in single utterances and 20% in 10-times longer (repeated) utterances. Finally, we propose a change to the attention mechanism that prevents it from concentrating too much on single frames, which further reduces PER to 17.6% level.
Researcher Affiliation Academia Jan Chorowski University of Wrocław, Poland jan.chorowski@ii.uni.wroc.pl Dzmitry Bahdanau Jacobs University Bremen, Germany Dmitriy Serdyuk Universit e de Montr eal Kyunghyun Cho Universit e de Montr eal Yoshua Bengio Universit e de Montr eal CIFAR Senior Fellow
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the methodology described.
Open Datasets Yes We used the train-dev-test split from the Kaldi [20] TIMIT s5 recipe. We trained on the standard 462 speaker set with all SA utterances removed and used the 50 speaker dev set for early stopping. We tested on the 24 speaker core test set. All networks were trained on 40 mel-scale filterbank features together with the energy in each frame, and first and second temporal differences, yielding in total 123 features per frame.
Dataset Splits Yes We used the train-dev-test split from the Kaldi [20] TIMIT s5 recipe. We trained on the standard 462 speaker set with all SA utterances removed and used the 50 speaker dev set for early stopping. We tested on the 24 speaker core test set.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies Yes All experiments were conducted using Theano [27, 28], Py Learn2 [29], and Blocks [30] libraries.
Experiment Setup Yes We used an adaptive learning rate algorithm, Ada Delta [21] which has two hyperparameters ϵ and ρ. All the weight matrices were initialized from a normal Gaussian distribution with its standard deviation set to 0.01. Recurrent weights were orthogonalized. ... During this time, ϵ and ρ are set to 10 8 and 0.95, respectively. ... Once the new lowest development log-likelihood was reached, we fine-tuned the model with a smaller ϵ = 10 10, until we did not observe the improvement in the development phoneme error rate (PER) for 100K weight updates. Batch size 1 was used throughout the training.