Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Authors: Adam Stooke, Rohit Prabhavalkar, Khe Sim, Pedro Moreno Mengibar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED.
Researcher Affiliation Industry Adam Stooke Google, USA astooke@google.com Rohit Prabhavalkar Google, USA Khe Chai Sim Google, USA Pedro Moreno Mengibar
Pseudocode No The entire model is written by its recurrence relations as: h = fenc(x) (1) gi = fpred(gi 1, yi 1), i U (2) P(yi|x, y<i) = fjoint(hi, gi), i U (3) The encoder and the decoder which includes both the prediction and joint networks are parameterized and learned together in an end-to-end manner, with total parameters θ. We maximize the log probabilities of the correct labels, resulting in the familiar cross-entropy loss: LAligner(θ) = i=1 log P(yi|x, y<i; θ) (4)
Open Source Code No Answer: [No] Justification: We conducted experiments on two proprietary datasets, which cannot be released, and to balance this we also conducted experiments on open-source data which is already fully available. We are also unable to release code, however as discussed under reproducibility, we share full details of the implementation settings, and the simpler nature of our model, it requires no coding tricks relative to previous models.
Open Datasets Yes We experiment on three U.S. English datasets with very different characteristics. The first is Libri Speech-960 hour (LS) [42].
Dataset Splits Yes Table 3: WER (%) on Libri Speech. DEV TEST-CLEAN TEST-OTHER
Hardware Specification No While the exact numbers will depend on many hardware and implementation details, we present a representative case using our 100M-parameter encoder Libri Speech models in Table 5.
Software Dependencies No Table 6: Libri Speech common training settings.
Experiment Setup Yes Table 1: Settings used with each dataset: Libri Speech, Voice Search, and You Tube.