Efficient Sequence Transduction by Jointly Predicting Tokens and Durations
Authors: Hainan Xu, Fei Jia, Somshubra Majumdar, He Huang, Shinji Watanabe, Boris Ginsburg
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our model in three different tasks: speech recognition, speech translation, and spoken language understanding. We use the Ne Mo (Kuchaiev et al., 2019) toolkit for all experiments... TDT models achieve both better accuracy and significantly faster inference than conventional Transducers on different sequence transduction tasks. |
| Researcher Affiliation | Collaboration | 1NVIDIA, USA 2Carnegie Mellon University, PA, USA. |
| Pseudocode | Yes | Algorithm 1 Greedy Inference of Conventional Transducer... Algorithm 2 Greedy Inference of TDT Models |
| Open Source Code | Yes | Our implementation of the TDT model will be open-sourced with the Ne Mo (https: //github.com/NVIDIA/Ne Mo) toolkit. |
| Open Datasets | Yes | Our English ASR models are trained on the Librispeech (Panayotov et al., 2015) set with 960 hours of speech. Speed perturbation with factors (0.9, 1.0, 1.1) is performed to augment the dataset... Our Spanish models are trained on combination of Mozilla Common Voice (MCV) (Ardila et al., 2019), Multilingual Librispeech (MLS) (Pratap et al., 2020), Voxpopuli (Wang et al., 2021a), and Fisher (LDC2010S01) dataset with 1340 hours in total... The German ASR was trained on MCV, MLS, and Voxpopuli datasets, with a total of around 2000 hours. |
| Dataset Splits | Yes | For all experiments, we train our models for no more than 200 epochs, and run checkpoint-averaging performed on 5 checkpoints with the best performance on validation data, to generate the model for evaluation. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, or memory specifications) used for running experiments were mentioned. |
| Software Dependencies | No | The paper mentions using Ne Mo toolkit and Py Torch (Paszke et al., 2019) and Adam optimizer (Kingma & Ba, 2015) but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Unless specified otherwise, we use Conformer Large for all tasks. For acoustic feature extraction, we use audio frames of 10 ms and window sizes of 25 ms. Our model has a conformer encoder with 17 layers with numheads = 8, and relative position embeddings. The hidden dimension of all the conformer layers is set to 512, and for the feed-forward layers in the conformer, an expansion factor of 4 is used. The convolution layers use a kernel size of 31. At the beginning of the encoder, convolution-based subs-ampling is performed with subsampling rate 4. All models have around 120M parameters... We use different subword-based tokenizers for different models... Unless specified otherwise, logit under-normalization is used during training with σ = 0.05. For all experiments, we train our models for no more than 200 epochs, and run checkpoint-averaging performed on 5 checkpoints with the best performance on validation data, to generate the model for evaluation. We run non-batched greedy search inference for all evaluations reported in this Section. No external LM is used in any of our experiments. |