A fully differentiable beam search decoder
Authors: Ronan Collobert, Awni Hannun, Gabriel Synnaeve
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply DBD to the task of automatic speech recognition and show competitive performance on the Wall Street Journal (WSJ) corpus (Paul & Baker, 1992). 6. Experiments We performed experiments with WSJ (about 81h of transcribed audio data). |
| Researcher Affiliation | Industry | 1Facebook AI Research. Correspondence to: Ronan Collobert <locronan@fb.com>, Awni Hannun <awni@fb.com>, Gabriel Synnaeve <gab@fb.com>. |
| Pseudocode | No | The paper describes algorithms verbally and mathematically, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing the source code or a link to a code repository. |
| Open Datasets | Yes | We apply DBD to the task of automatic speech recognition and show competitive performance on the Wall Street Journal (WSJ) corpus (Paul & Baker, 1992). |
| Dataset Splits | Yes | We consider the standard subsets si284, nov93dev and nov92 for training, validation and test, respectively. |
| Hardware Specification | No | Both the neural network acoustic model and the ASG criterion run on a single GPU. The DBD criterion is CPU-only. No specific GPU or CPU models are mentioned. |
| Software Dependencies | No | The paper mentions Ken LM but does not specify a version number for it or any other key software dependencies. |
| Experiment Setup | Yes | We use log-mel filterbanks as features fed to the acoustic model, with 40 filters of size 25ms, strided by 10ms. ... We consider an end-to-end setup, where the token set D (see Section 2) includes English letters (a-z), the apostrophe and the period character, as well as a space character, leading to 29 different tokens. ... All the models are trained with stochastic gradient descent (SGD), enhanced with gradient clipping (Pascanu et al., 2013) and weight normalization (Salimans & Kingma, 2016). We use batch training (16 utterances at once), sorting inputs by length for efficiency. ... Our best Conv Net acoustic model has 10M parameters and an overall receptive field of 1350ms. ... In most experiments, we use a beam size of 500, as larger beam sizes led to marginal WER improvements. |