Unsupervised Learning of Spoken Language with Visual Context

Authors: David Harwath, Antonio Torralba, James Glass

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For the experiments in this paper, we split a subset of our captions into a 114,000 utterance training set, a 2,400 utterance development set, and a 2,400 utterance testing set, covering a 27,891 word vocabulary (as specified by the Google ASR). The average caption duration was 9.5 seconds, and contained an average of 21.9 words. All the sets were randomly sampled, so many of the same speakers will appear in all three sets.
Researcher Affiliation Academia David Harwath, Antonio Torralba, and James R. Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02115 {dharwath, torralba, jrg}@csail.mit.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper states: 'We plan to make our dataset publicly available in the near future.' This refers to the dataset, not the code, and indicates future availability rather than current. No statement is made about releasing the source code for their methodology.
Open Datasets Yes Since we desire spontaneously spoken audio captions, we collected a new corpus of captions for the Places205 dataset [23]. Places205 contains over 2.5 million images categorized into 205 different scene classes, providing a rich variety of object types in many different contexts.
Dataset Splits Yes For the experiments in this paper, we split a subset of our captions into a 114,000 utterance training set, a 2,400 utterance development set, and a 2,400 utterance testing set, covering a 27,891 word vocabulary (as specified by the Google ASR).
Hardware Specification Yes All models were trained on an NVIDIA Titan X GPU, which usually took about 2 days.
Software Dependencies No The paper mentions software components like VGG 16 layer network [29], Spoke Java Script framework [27], Google Speech Recognition service, and Kaldi [28], but it does not provide specific version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup Yes Each minibatch consists of B ground truth pairs... In practice, we set our minibatch size to 128, used a constant momentum of 0.9, and ran SGD training for 50 epochs. Learning rates took a bit of tuning to get right. In the end, we settled on an initial value of 1e-5, and employed a schedule which decreased the learning rate by a factor between 2 and 5 every 5 to 10 epochs.