reproducibilityindex.ai

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

Authors: David Harwath*, Wei-Ning Hsu*, James Glass

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is conﬁgured. ... We evaluate the sub-word units on the Zero Speech 2019 challenge, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same. We also present experiments demonstrating the noise robustness of these units. Finally, we show that a model with multiple quantizers can simultaneously learn phone-like detectors at a lower layer and word-like detectors at a higher layer.
Researcher Affiliation	Academia	David Harwath , Wei-Ning Hsu , and James Glass Computer Science and Artiﬁcial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139, USA {dharwath,wnhsu,glass}@csail.mit.edu
Pseudocode	No	The paper describes the model architecture and training process in text and diagrams (e.g., Figure 1), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper references code for a baseline model ('Wave Net-VQ (Chorowski et al., 2019) provided by Cho et al. (2019) and FHVAE-DPGMM (Feng et al., 2019). Using the code accompanied with the Wave Net-VQ submission, we were able to train their model...'), but it does not provide a link or statement about the availability of the authors' own Res DAVEnet-VQ model code.
Open Datasets	Yes	For training our models, we utilize the MIT Places 205 dataset (Zhou et al., 2014) and their accompanying spoken audio captions (Harwath et al., 2016; 2018b).
Dataset Splits	Yes	For vetting our models during training, we use a held-out validation set of 1,000 image-caption pairs. We trained each model on the Places audio caption train split, and computed the image and caption recall at 10 (R@10) scores on the validation split of the Places audio captions after each training epoch.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU model, CPU, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using the Adam optimizer, but it does not specify any software dependencies (e.g., libraries, frameworks) with version numbers that would be required for reproducibility.
Experiment Setup	Yes	All of our models were trained for 180 epochs using the Adam optimizer (Kingma & Ba, 2014) with a batch size of 80. We used an exponentially decaying learning rate schedule, with an initial value of 2e-4 that decayed by a factor of 0.95 every 3 epochs. Following van den Oord et al. (2017), we use an EMA decay factor of γ = .99 for training each VQ codebook. Our core experimental results all use a codebook size of 1024 vectors for all quantizers, but in the supplementary material we include experiments with smaller and larger codebooks. Following Chorowski et al. (2019), the jitter probability hyperparameter for each quantization layer was ﬁxed at 0.12.