reproducibilityindex.ai

Revisiting the Entropy Semiring for Neural Speech Recognition

Authors: Oscar Chang, Dongseong Hwang, Olivier Siohan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario.
Researcher Affiliation	Industry	Oscar Chang, Dongseong Hwang, Olivier Siohan Google {oscarchang,dongseong,siohan}@google.com
Pseudocode	No	The paper defines mathematical concepts and provides examples of semirings, but it does not include any sections explicitly labeled 'Pseudocode' or 'Algorithm' with structured code blocks.
Open Source Code	Yes	One of the main contributions of our work is to make an opensource implementation of CTC and RNN-T in the semiring framework available to the research community (cf. Supplementary Material).
Open Datasets	Yes	We experimented with models using non-causal LSTM and Conformer (Gulati et al., 2020) encoders on the Librispeech dataset (Panayotov et al., 2015).
Dataset Splits	Yes	We report 95% conﬁdence interval estimates for the WER following Vilar (2008).
Hardware Specification	No	The paper mentions training various models (LSTM, Conformer) with specific parameters and training steps, but it does not provide any specific details about the hardware (e.g., GPU model, CPU type, memory) used for these experiments.
Software Dependencies	No	The paper mentions 'Tensorﬂow' as an implementation environment in Appendix A, but it does not specify version numbers for TensorFlow or any other software dependencies, which is required for reproducibility.
Experiment Setup	Yes	All models are trained with Adam using the optimization schedule speciﬁed in Vaswani et al. (2017), with a 10k warmup, batch size 2048, and a peak learning rate of 0.002. The LSTM encoders have 4 bi-directional layers with cell size 512 and are trained for 100k steps, while the Conformer encoders have 16 full-context attention layers with model dimension 144 and are trained for 400k steps. Decoding for all models is done with beam search, with the CTC decoders using a beam width of 16, and the RNN-T decoders using a beam width of 8 and a 1-layer LSTM with cell size 320. All models use a Word Piece tokenizer with a vocabulary size of 1024. αEnt was selected via a grid search on {0.01, 0.001}.