Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
Authors: Santiago Cuervo, Adrian Lancucki, Ricard Marxer, Paweł Rychlikowski, Jan K. Chorowski
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that the quality of learned representations can be improved by explicitly modeling the data at two different, non-uniform sampling rates, and with different training criteria applied at each level of data modeling. We demonstrate our results on speech, which is a hierarchical signal with a well understood structure. We extend a frame-wise feature extraction model with a boundary predictor, followed by another feature transformation applied to variable length contiguous segments of frames. We show that through a careful design of the second level training criterion we can improve on the quality of the learned representations and perform unit discovery by detecting boundaries which overlap with ground-truth segmentation of the data. |
| Researcher Affiliation | Collaboration | 1 University of Wrocław, Poland 2 Université de Toulon, Aix Marseille Univ, CNRS, LIS, France 3 NVIDIA, Poland 4 Pathway, France |
| Pseudocode | No | The paper describes the model architecture and training objectives in detail but does not include any explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | We make our code available at https: //github.com/chorowski-lab/h CPC. |
| Open Datasets | Yes | The experiments were performed on the train-clean-100 subset of Libri Speech (Panayotov et al. [2015], CC BY 4.0) using the aligned phone labels provided by van den Oord et al. [2018]. |
| Dataset Splits | Yes | The experiments were performed on the train-clean-100 subset of Libri Speech (Panayotov et al. [2015], CC BY 4.0) using the aligned phone labels provided by van den Oord et al. [2018]... We compute the ABX score on the dev-clean set. |
| Hardware Specification | Yes | Training our model takes around 13 hours on 2x Nvidia RTX 3080 GPUs. |
| Software Dependencies | No | The paper mentions software components like "Adam optimizer" and types of networks like "LSTM network" and "transformer layer", but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions). |
| Experiment Setup | Yes | All models are trained using a batch size of 64, the Adam optimizer [Kingma and Ba, 2015] with a learning rate of 0.0002, and an initial warm-up phase where the learning rate is increased linearly during the first 10 epochs... We use N =12 and M =2 predictions, and 128 and 1 negative samples, for the CPC losses at low and high level models respectively... The target quantizer q uses a codebook of 512 embeddings each of dimension 256, and we use the standard literature value of λ = 0.25 in equation 4. The boundary predictor is a single layer bidirectional transformer network with 8 scaled dot-product attention heads with internal dimension 2048. We set the expected average unit duration l = 8 in equation 7 to roughly match the average phone duration measured on the 100 hour subset of the Libri Speech dataset (7.58 frames). |