DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

Authors: Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, Jim Glass

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that Dino SR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. ... Quantitatively, Dino SR surpasses the state-of-the-art in speech recognition with limited resources on Libri Speech [15] and unsupervised acoustic unit discovery [16].
Researcher Affiliation Collaboration MIT CSAIL Meta AI alexhliu@mit.edu
Pseudocode Yes A.3 Pseudo-code for Dino SR training Algorithm 1 Py Torch pseudocode for Dino SR
Open Source Code Yes Code available at https://github.com/Alexander-H-Liu/dinosr.
Open Datasets Yes Following Hsu et al. [20] and Baevski et al. [9], we use 960 hours of speech from the Libri Speech [15] corpus to pre-train our model.
Dataset Splits Yes We fine-tune the student model using CTC [35] using labeled speech data under four different setups, using 10 minutes / 1 hour / 10 hours from Libri Light [36] or 100 hours from Libri Speech [15]. ... We use the 10-hour subset to fine-tune the teacher network after 200k steps of pre-training. WERs are reported by decoding the dev-other subset with a fixed language model weight of 2, and word insertion penalty of 1, following Baevski et al. [13].
Hardware Specification Yes Pre-training the model takes about 180 hours on 16 Nvidia V100 GPUs.
Software Dependencies No The paper mentions software like PyTorch and the Adam optimizer, but it does not specify version numbers for these or any other key software components, which is required for reproducible ancillary software description.
Experiment Setup Yes We focus on the BASE sized transformer [4] with K = 12 layers and embedding dimension D = 768 due to resource constraints, with the batch size of 63 minutes of audio in total across 16 GPUs. ... The student model is trained for 400k steps with the Adam optimizer [31] with a learning rate ramped up linearly to 0.0005 within the first 12k steps, held for the following 188k steps, and exponentially decayed to 0.00005 for the final 200k steps.