Understanding Adaptive, Multiscale Temporal Integration In Deep Speech Recognition Systems
Authors: Menoua Keshishian, Samuel Norman-Haignere, Nima Mesgarani
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We applied our method to understand how the popular Deep Speech2 model learns to integrate across time in speech. We find that nearly all of the model units, even in recurrent layers, have a compact integration window within which stimuli substantially alter the response and outside of which stimuli have little effect. We show that training causes these integration windows to shrink at early layers and expand at higher layers, creating a hierarchy of integration windows across the network. |
| Researcher Affiliation | Academia | Menoua Keshishian Department of Electrical Engineering Zuckerman Mind Brain Behavior Institute Columbia University New York, NY 10027 mk4011@columbia.edu Sam V. Norman-Haignere Department of Electrical Engineering Zuckerman Mind Brain Behavior Institute Columbia University New York, NY 10027 sn2776@columbia.edu Nima Mesgarani Department of Electrical Engineering Zuckerman Mind Brain Behavior Institute Columbia University New York, NY 10027 nima@ee.columbia.edu |
| Pseudocode | No | The paper describes the methods and analysis steps in prose and with mathematical equations, but does not include any blocks explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Code available at: https://github.com/naplab/Py TCI |
| Open Datasets | Yes | Sound segments were excerpted from the Libri Speech corpus dev-clean and test-clean sets. All models were implemented in Py Torch (Paszke et al., 2019) and trained using Py Torch Lightning (Falcon, 2019) on the training set of the Libri Speech corpus (Panayotov et al., 2015). |
| Dataset Splits | No | The paper mentions using 'Libri Speech test-clean set' for evaluation and 'training set of the Libri Speech corpus' for training, but does not specify a separate validation set split or its size/proportion. |
| Hardware Specification | Yes | Training and inference of all models were performed on NVIDIA A40 GPUs (one per training/inference) at the internal cluster at the Zuckerman Institute of Columbia University. |
| Software Dependencies | No | All models were implemented in Py Torch (Paszke et al., 2019) and trained using Py Torch Lightning (Falcon, 2019)... Augmentations were performed using the Sound e Xchange (So X) backend of the audio library for Py Torch (torchaudio). While PyTorch and PyTorch Lightning are cited, their specific versions are not explicitly stated in the text for replication. |
| Experiment Setup | Yes | We used the CTC loss (Graves et al., 2006), the Adam optimizer (learning rate: 1.5e 4, weight decay: 1e 5) (Kingma and Ba, 2014), and a batch size of 64. (trained for 20 epochs). |