Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers
Authors: Adam Stooke, Rohit Prabhavalkar, Khe Sim, Pedro Moreno Mengibar
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. |
| Researcher Affiliation | Industry | Adam Stooke Google, USA astooke@google.com Rohit Prabhavalkar Google, USA Khe Chai Sim Google, USA Pedro Moreno Mengibar |
| Pseudocode | No | The entire model is written by its recurrence relations as: h = fenc(x) (1) gi = fpred(gi 1, yi 1), i U (2) P(yi|x, y<i) = fjoint(hi, gi), i U (3) The encoder and the decoder which includes both the prediction and joint networks are parameterized and learned together in an end-to-end manner, with total parameters θ. We maximize the log probabilities of the correct labels, resulting in the familiar cross-entropy loss: LAligner(θ) = i=1 log P(yi|x, y<i; θ) (4) |
| Open Source Code | No | Answer: [No] Justification: We conducted experiments on two proprietary datasets, which cannot be released, and to balance this we also conducted experiments on open-source data which is already fully available. We are also unable to release code, however as discussed under reproducibility, we share full details of the implementation settings, and the simpler nature of our model, it requires no coding tricks relative to previous models. |
| Open Datasets | Yes | We experiment on three U.S. English datasets with very different characteristics. The first is Libri Speech-960 hour (LS) [42]. |
| Dataset Splits | Yes | Table 3: WER (%) on Libri Speech. DEV TEST-CLEAN TEST-OTHER |
| Hardware Specification | No | While the exact numbers will depend on many hardware and implementation details, we present a representative case using our 100M-parameter encoder Libri Speech models in Table 5. |
| Software Dependencies | No | Table 6: Libri Speech common training settings. |
| Experiment Setup | Yes | Table 1: Settings used with each dataset: Libri Speech, Voice Search, and You Tube. |