DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Authors: Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, Jim Glass
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that Dino SR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. ... Quantitatively, Dino SR surpasses the state-of-the-art in speech recognition with limited resources on Libri Speech [15] and unsupervised acoustic unit discovery [16]. |
| Researcher Affiliation | Collaboration | MIT CSAIL Meta AI alexhliu@mit.edu |
| Pseudocode | Yes | A.3 Pseudo-code for Dino SR training Algorithm 1 Py Torch pseudocode for Dino SR |
| Open Source Code | Yes | Code available at https://github.com/Alexander-H-Liu/dinosr. |
| Open Datasets | Yes | Following Hsu et al. [20] and Baevski et al. [9], we use 960 hours of speech from the Libri Speech [15] corpus to pre-train our model. |
| Dataset Splits | Yes | We fine-tune the student model using CTC [35] using labeled speech data under four different setups, using 10 minutes / 1 hour / 10 hours from Libri Light [36] or 100 hours from Libri Speech [15]. ... We use the 10-hour subset to fine-tune the teacher network after 200k steps of pre-training. WERs are reported by decoding the dev-other subset with a fixed language model weight of 2, and word insertion penalty of 1, following Baevski et al. [13]. |
| Hardware Specification | Yes | Pre-training the model takes about 180 hours on 16 Nvidia V100 GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch and the Adam optimizer, but it does not specify version numbers for these or any other key software components, which is required for reproducible ancillary software description. |
| Experiment Setup | Yes | We focus on the BASE sized transformer [4] with K = 12 layers and embedding dimension D = 768 due to resource constraints, with the batch size of 63 minutes of audio in total across 16 GPUs. ... The student model is trained for 400k steps with the Adam optimizer [31] with a learning rate ramped up linearly to 0.0005 within the first 12k steps, held for the following 188k steps, and exponentially decayed to 0.00005 for the final 200k steps. |