Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

Authors: David Harwath*, Wei-Ning Hsu*, James Glass

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. ... We evaluate the sub-word units on the Zero Speech 2019 challenge, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same. We also present experiments demonstrating the noise robustness of these units. Finally, we show that a model with multiple quantizers can simultaneously learn phone-like detectors at a lower layer and word-like detectors at a higher layer.
Researcher Affiliation Academia David Harwath , Wei-Ning Hsu , and James Glass Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139, USA {dharwath,wnhsu,glass}@csail.mit.edu
Pseudocode No The paper describes the model architecture and training process in text and diagrams (e.g., Figure 1), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper references code for a baseline model ('Wave Net-VQ (Chorowski et al., 2019) provided by Cho et al. (2019) and FHVAE-DPGMM (Feng et al., 2019). Using the code accompanied with the Wave Net-VQ submission, we were able to train their model...'), but it does not provide a link or statement about the availability of the authors' own Res DAVEnet-VQ model code.
Open Datasets Yes For training our models, we utilize the MIT Places 205 dataset (Zhou et al., 2014) and their accompanying spoken audio captions (Harwath et al., 2016; 2018b).
Dataset Splits Yes For vetting our models during training, we use a held-out validation set of 1,000 image-caption pairs. We trained each model on the Places audio caption train split, and computed the image and caption recall at 10 (R@10) scores on the validation split of the Places audio captions after each training epoch.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU model, CPU, memory) used for running the experiments.
Software Dependencies No The paper mentions using the Adam optimizer, but it does not specify any software dependencies (e.g., libraries, frameworks) with version numbers that would be required for reproducibility.
Experiment Setup Yes All of our models were trained for 180 epochs using the Adam optimizer (Kingma & Ba, 2014) with a batch size of 80. We used an exponentially decaying learning rate schedule, with an initial value of 2e-4 that decayed by a factor of 0.95 every 3 epochs. Following van den Oord et al. (2017), we use an EMA decay factor of γ = .99 for training each VQ codebook. Our core experimental results all use a codebook size of 1024 vectors for all quantizers, but in the supplementary material we include experiments with smaller and larger codebooks. Following Chorowski et al. (2019), the jitter probability hyperparameter for each quantization layer was fixed at 0.12.