reproducibilityindex.ai

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Authors: Adam Stooke, Rohit Prabhavalkar, Khe Sim, Pedro Moreno Mengibar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference conﬁguration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED.
Researcher Affiliation	Industry	Adam Stooke Google, USA astooke@google.com Rohit Prabhavalkar Google, USA Khe Chai Sim Google, USA Pedro Moreno Mengibar
Pseudocode	No	The entire model is written by its recurrence relations as: h = fenc(x) (1) gi = fpred(gi 1, yi 1), i U (2) P(yi\|x, y<i) = fjoint(hi, gi), i U (3) The encoder and the decoder which includes both the prediction and joint networks are parameterized and learned together in an end-to-end manner, with total parameters θ. We maximize the log probabilities of the correct labels, resulting in the familiar cross-entropy loss: LAligner(θ) = i=1 log P(yi\|x, y<i; θ) (4)
Open Source Code	No	Answer: [No] Justiﬁcation: We conducted experiments on two proprietary datasets, which cannot be released, and to balance this we also conducted experiments on open-source data which is already fully available. We are also unable to release code, however as discussed under reproducibility, we share full details of the implementation settings, and the simpler nature of our model, it requires no coding tricks relative to previous models.
Open Datasets	Yes	We experiment on three U.S. English datasets with very different characteristics. The ﬁrst is Libri Speech-960 hour (LS) [42].
Dataset Splits	Yes	Table 3: WER (%) on Libri Speech. DEV TEST-CLEAN TEST-OTHER
Hardware Specification	No	While the exact numbers will depend on many hardware and implementation details, we present a representative case using our 100M-parameter encoder Libri Speech models in Table 5.
Software Dependencies	No	Table 6: Libri Speech common training settings.
Experiment Setup	Yes	Table 1: Settings used with each dataset: Libri Speech, Voice Search, and You Tube.