Optimal Completion Distillation for Sequence Learning
Authors: Sara Sabour, William Chan, Mohammad Norouzi
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | OCD achieves the state-of-the-art performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving 9.3% and 4.5% word error rates, respectively. |
| Researcher Affiliation | Industry | Sara Sabour, William Chan, Mohammad Norouzi {sasabour, williamchan, mnorouzi}@google.com Google Brain |
| Pseudocode | Yes | Procedure 1 Edit Distance Q op returns Q-values of the tokens at each time step based on the minimum edit distance between a reference sequence r and a hypothesis sequence h of length t. |
| Open Source Code | No | We are in the process of releasing the code for OCD. |
| Open Datasets | Yes | We conduct our experiments on speech recogntion on the Wall Street Journal (WSJ) (Paul and Baker, 1992) and Librispeech (Panayotov et al., 2015) benchmarks. |
| Dataset Splits | Yes | We use the standard configuration of si284 for training, dev93 for validation and report both test Character Error Rate (CER) and Word Error Rate (WER) on eval92. [...] For the Librispeech dataset, we train on the full training set (960h audio data) and validate our results on the dev-other set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Tensor Flow (Abadi et al., 2016)' but does not specify a version number for the software itself, nor for any other libraries or dependencies. |
| Experiment Setup | Yes | Our encoder uses 2-layers of convolutions with 3x3 filters, stride 2x2 and 32 channels, followed by a convolutional LSTM with 1D-convolution of filter width 3, followed by 3 LSTM layers with 256 cell size. [...] train our models for 300 epochs of batch size 8 with 8 async workers. We separately tune the learning rate for our baseline and OCD model, 0.0007 for OCD vs 0.001 for baseline. |