Conditioning Sequence-to-sequence Networks with Learned Activations
Authors: Alberto Gil Couto Pimentel Ramos, Abhinav Mehrotra, Nicholas Donald Lane, Sourav Bhattacharya
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS PART I: PSE FOR TELEPHONY AND PRE-TRAINED ASR. We evaluate LA on three PSE model families ( 4.2), on two datasets Librispeech (Panayotov et al., 2015) and Voxforge (vox, 2006). We show that the use of LA on all models achieves competitive performance to conditioning using concatenation and modulation approaches. |
| Researcher Affiliation | Collaboration | 1Samsung AI Centre, Cambridge, UK 2University of Cambridge, UK {a.gilramos,a.mehrotra1,nic.lane,sourav.b1}@samsung.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statements about releasing source code or links to a code repository. |
| Open Datasets | Yes | We consider two representative datasets: Libri Speech (Panayotov et al., 2015) and Voxforge (vox, 2006). Vox Forge Corpus. http://www.voxforge.org, 2006. |
| Dataset Splits | No | For the English training set, we take 100h and 360h of clean speech from Libri Speech. We create a training dataset by taking 32.5h of Spanish audio from Voxforge. We augment each sample tuple (audio, transcript) to include two seconds enrollment audio from the same user (audio, transcript, enrollment). We ensure the enrollment audio is taken from a different sample from the same user. For diversity during training the enrollment audio is selected randomly under the aforementioned constraint, whereas for evaluation it is always the same. |
| Hardware Specification | No | Models are trained in a data-parallel way with four GPUs using Tensor Flow and Horovod. |
| Software Dependencies | No | Models are trained in a data-parallel way with four GPUs using Tensor Flow and Horovod. We use open-source Silero (sil, 2021) ASR models (English and Spanish) in our evaluations. |
| Experiment Setup | Yes | We apply Short Time Fourier Transform (STFT) to extract 512 coefficients using a 32 ms window and with a stride of 16 ms. Models are trained in a data-parallel way with four GPUs using Tensor Flow and Horovod. We use batch size of 64, learning rates in [1e 5, 1e 3], exponential decay learning rate scheduler and early stopping (max epochs 100). For character and subword models we consider greedy (beam search with tops paths and beam width equal to one) and beam search (with a beam width of four) CTC decoders. |