Conditioning Sequence-to-sequence Networks with Learned Activations

Authors: Alberto Gil Couto Pimentel Ramos, Abhinav Mehrotra, Nicholas Donald Lane, Sourav Bhattacharya

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS PART I: PSE FOR TELEPHONY AND PRE-TRAINED ASR. We evaluate LA on three PSE model families ( 4.2), on two datasets Librispeech (Panayotov et al., 2015) and Voxforge (vox, 2006). We show that the use of LA on all models achieves competitive performance to conditioning using concatenation and modulation approaches.
Researcher Affiliation Collaboration 1Samsung AI Centre, Cambridge, UK 2University of Cambridge, UK {a.gilramos,a.mehrotra1,nic.lane,sourav.b1}@samsung.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statements about releasing source code or links to a code repository.
Open Datasets Yes We consider two representative datasets: Libri Speech (Panayotov et al., 2015) and Voxforge (vox, 2006). Vox Forge Corpus. http://www.voxforge.org, 2006.
Dataset Splits No For the English training set, we take 100h and 360h of clean speech from Libri Speech. We create a training dataset by taking 32.5h of Spanish audio from Voxforge. We augment each sample tuple (audio, transcript) to include two seconds enrollment audio from the same user (audio, transcript, enrollment). We ensure the enrollment audio is taken from a different sample from the same user. For diversity during training the enrollment audio is selected randomly under the aforementioned constraint, whereas for evaluation it is always the same.
Hardware Specification No Models are trained in a data-parallel way with four GPUs using Tensor Flow and Horovod.
Software Dependencies No Models are trained in a data-parallel way with four GPUs using Tensor Flow and Horovod. We use open-source Silero (sil, 2021) ASR models (English and Spanish) in our evaluations.
Experiment Setup Yes We apply Short Time Fourier Transform (STFT) to extract 512 coefficients using a 32 ms window and with a stride of 16 ms. Models are trained in a data-parallel way with four GPUs using Tensor Flow and Horovod. We use batch size of 64, learning rates in [1e 5, 1e 3], exponential decay learning rate scheduler and early stopping (max epochs 100). For character and subword models we consider greedy (beam search with tops paths and beam width equal to one) and beam search (with a beam width of four) CTC decoders.