Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

Authors: Santiago Cuervo, Adrian Lancucki, Ricard Marxer, Paweł Rychlikowski, Jan K. Chorowski

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that the quality of learned representations can be improved by explicitly modeling the data at two different, non-uniform sampling rates, and with different training criteria applied at each level of data modeling. We demonstrate our results on speech, which is a hierarchical signal with a well understood structure. We extend a frame-wise feature extraction model with a boundary predictor, followed by another feature transformation applied to variable length contiguous segments of frames. We show that through a careful design of the second level training criterion we can improve on the quality of the learned representations and perform unit discovery by detecting boundaries which overlap with ground-truth segmentation of the data.
Researcher Affiliation Collaboration 1 University of Wrocław, Poland 2 Université de Toulon, Aix Marseille Univ, CNRS, LIS, France 3 NVIDIA, Poland 4 Pathway, France
Pseudocode No The paper describes the model architecture and training objectives in detail but does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code Yes We make our code available at https: //github.com/chorowski-lab/h CPC.
Open Datasets Yes The experiments were performed on the train-clean-100 subset of Libri Speech (Panayotov et al. [2015], CC BY 4.0) using the aligned phone labels provided by van den Oord et al. [2018].
Dataset Splits Yes The experiments were performed on the train-clean-100 subset of Libri Speech (Panayotov et al. [2015], CC BY 4.0) using the aligned phone labels provided by van den Oord et al. [2018]... We compute the ABX score on the dev-clean set.
Hardware Specification Yes Training our model takes around 13 hours on 2x Nvidia RTX 3080 GPUs.
Software Dependencies No The paper mentions software components like "Adam optimizer" and types of networks like "LSTM network" and "transformer layer", but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup Yes All models are trained using a batch size of 64, the Adam optimizer [Kingma and Ba, 2015] with a learning rate of 0.0002, and an initial warm-up phase where the learning rate is increased linearly during the first 10 epochs... We use N =12 and M =2 predictions, and 128 and 1 negative samples, for the CPC losses at low and high level models respectively... The target quantizer q uses a codebook of 512 embeddings each of dimension 256, and we use the standard literature value of λ = 0.25 in equation 4. The boundary predictor is a single layer bidirectional transformer network with 8 scaled dot-product attention heads with internal dimension 2048. We set the expected average unit duration l = 8 in equation 7 to roughly match the average phone duration measured on the 100 hour subset of the Libri Speech dataset (7.58 frames).