Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling
Authors: Hairong Liu, Zhenyao Zhu, Xiangang Li, Sanjeev Satheesh
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that the proposed Gram-CTC improves CTC in terms of both performance and efficiency on the large vocabulary speech recognition task at multiple scales of data, and that with Gram-CTC we can outperform the state-of-the-art on a standard speech benchmark. |
| Researcher Affiliation | Industry | 1Baidu Silicon Valley AI Lab, 1195 Bordeaux Dr, Sunnyvale, CA 94089, USA. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | Wall Street Journal (WSJ). This corpora consists primarily of read speech with texts drawn from a machine-readable corpus of Wall Street Journal news text, and contains about 80 hours speech data. We used the standard configuration of train si284 dataset for training, dev93 for validation and eval92 for testing. Fisher-Switchboard. This is a commonly used English conversational telephone speech (CTS) corpora, which contains 2300 hours CTS data. |
| Dataset Splits | Yes | We used the standard configuration of train si284 dataset for training, dev93 for validation and eval92 for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers needed to replicate the experiment. |
| Experiment Setup | Yes | The network inputs are thus spectral magnitude maps ranging from 0-8k Hz with 161 features per 10ms frame. At each epoch, 40% of the utterances are randomly selected to add background noise to. The optimization method we use is stochastic gradient descent with Nesterov momentum. Typical values are a learning rate of 10^-3 and momentum of 0.99. |