Revisiting the Entropy Semiring for Neural Speech Recognition
Authors: Oscar Chang, Dongseong Hwang, Olivier Siohan
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario. |
| Researcher Affiliation | Industry | Oscar Chang, Dongseong Hwang, Olivier Siohan Google {oscarchang,dongseong,siohan}@google.com |
| Pseudocode | No | The paper defines mathematical concepts and provides examples of semirings, but it does not include any sections explicitly labeled 'Pseudocode' or 'Algorithm' with structured code blocks. |
| Open Source Code | Yes | One of the main contributions of our work is to make an opensource implementation of CTC and RNN-T in the semiring framework available to the research community (cf. Supplementary Material). |
| Open Datasets | Yes | We experimented with models using non-causal LSTM and Conformer (Gulati et al., 2020) encoders on the Librispeech dataset (Panayotov et al., 2015). |
| Dataset Splits | Yes | We report 95% confidence interval estimates for the WER following Vilar (2008). |
| Hardware Specification | No | The paper mentions training various models (LSTM, Conformer) with specific parameters and training steps, but it does not provide any specific details about the hardware (e.g., GPU model, CPU type, memory) used for these experiments. |
| Software Dependencies | No | The paper mentions 'Tensorflow' as an implementation environment in Appendix A, but it does not specify version numbers for TensorFlow or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | All models are trained with Adam using the optimization schedule specified in Vaswani et al. (2017), with a 10k warmup, batch size 2048, and a peak learning rate of 0.002. The LSTM encoders have 4 bi-directional layers with cell size 512 and are trained for 100k steps, while the Conformer encoders have 16 full-context attention layers with model dimension 144 and are trained for 400k steps. Decoding for all models is done with beam search, with the CTC decoders using a beam width of 16, and the RNN-T decoders using a beam width of 8 and a 1-layer LSTM with cell size 320. All models use a Word Piece tokenizer with a vocabulary size of 1024. αEnt was selected via a grid search on {0.01, 0.001}. |