Revisiting the Entropy Semiring for Neural Speech Recognition

Authors: Oscar Chang, Dongseong Hwang, Olivier Siohan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario.
Researcher Affiliation Industry Oscar Chang, Dongseong Hwang, Olivier Siohan Google {oscarchang,dongseong,siohan}@google.com
Pseudocode No The paper defines mathematical concepts and provides examples of semirings, but it does not include any sections explicitly labeled 'Pseudocode' or 'Algorithm' with structured code blocks.
Open Source Code Yes One of the main contributions of our work is to make an opensource implementation of CTC and RNN-T in the semiring framework available to the research community (cf. Supplementary Material).
Open Datasets Yes We experimented with models using non-causal LSTM and Conformer (Gulati et al., 2020) encoders on the Librispeech dataset (Panayotov et al., 2015).
Dataset Splits Yes We report 95% confidence interval estimates for the WER following Vilar (2008).
Hardware Specification No The paper mentions training various models (LSTM, Conformer) with specific parameters and training steps, but it does not provide any specific details about the hardware (e.g., GPU model, CPU type, memory) used for these experiments.
Software Dependencies No The paper mentions 'Tensorflow' as an implementation environment in Appendix A, but it does not specify version numbers for TensorFlow or any other software dependencies, which is required for reproducibility.
Experiment Setup Yes All models are trained with Adam using the optimization schedule specified in Vaswani et al. (2017), with a 10k warmup, batch size 2048, and a peak learning rate of 0.002. The LSTM encoders have 4 bi-directional layers with cell size 512 and are trained for 100k steps, while the Conformer encoders have 16 full-context attention layers with model dimension 144 and are trained for 400k steps. Decoding for all models is done with beam search, with the CTC decoders using a beam width of 16, and the RNN-T decoders using a beam width of 8 and a 1-layer LSTM with cell size 320. All models use a Word Piece tokenizer with a vocabulary size of 1024. αEnt was selected via a grid search on {0.01, 0.001}.