Latent Sequence Decompositions

Authors: William Chan, Yu Zhang, Quoc Le, Navdeep Jaitly

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment with the Wall Street Journal speech recognition task. Our LSD model achieves 12.9% WER compared to a character baseline of 14.8% WER.
Researcher Affiliation Collaboration William Chan Carnegie Mellon University williamchan@cmu.edu Yu Zhang Massachusetts Institute of Technology yzhang87@mit.edu Quoc V. Le, Navdeep Jaitly Google Brain {qvl,ndjaitly}@google.com
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper does not provide any concrete access information (e.g., specific repository link, explicit statement of code release, or code in supplementary materials) for the source code of the methodology described.
Open Datasets Yes We experimented with the Wall Street Journal (WSJ) ASR task. We used the standard configuration of train si284 dataset for training, dev93 for validation and eval92 for test evaluation.
Dataset Splits Yes We used the standard configuration of train si284 dataset for training, dev93 for validation and eval92 for test evaluation.
Hardware Specification No We used 8 GPU workers for asynchronous SGD under the Tensor Flow framework (Abadi et al., 2015). (No specific GPU models or other hardware details provided).
Software Dependencies No The paper mentions using 'Tensor Flow framework' and features generated by 'Kaldi', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Our input features were 80 dimensional filterbanks computed every 10ms with delta and delta-delta acceleration normalized with per speaker mean and variance as generated by Kaldi (Povey et al., 2011). The Encode RNN function is a 3 layer BLSTM with 256 LSTM units per-direction (or 512 total) and 4 = 22 time factor reduction. The Decode RNN is a 1 layer LSTM with 256 LSTM units. All the weight matrices were initialized with a uniform distribution U( 0.075, 0.075) and bias vectors to 0. Gradient norm clipping of 1 was used, gaussian weight noise N(0, 0.075) and L2 weight decay 1e 5 (Graves, 2011). We used ADAM with the default hyperparameters described in (Kingma & Ba, 2015), however we decayed the learning rate from 1e 3 to 1e 4.