reproducibilityindex.ai

Unsupervised Learning of Spoken Language with Visual Context

Authors: David Harwath, Antonio Torralba, James Glass

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For the experiments in this paper, we split a subset of our captions into a 114,000 utterance training set, a 2,400 utterance development set, and a 2,400 utterance testing set, covering a 27,891 word vocabulary (as speciﬁed by the Google ASR). The average caption duration was 9.5 seconds, and contained an average of 21.9 words. All the sets were randomly sampled, so many of the same speakers will appear in all three sets.
Researcher Affiliation	Academia	David Harwath, Antonio Torralba, and James R. Glass Computer Science and Artiﬁcial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02115 {dharwath, torralba, jrg}@csail.mit.edu
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper states: 'We plan to make our dataset publicly available in the near future.' This refers to the dataset, not the code, and indicates future availability rather than current. No statement is made about releasing the source code for their methodology.
Open Datasets	Yes	Since we desire spontaneously spoken audio captions, we collected a new corpus of captions for the Places205 dataset [23]. Places205 contains over 2.5 million images categorized into 205 different scene classes, providing a rich variety of object types in many different contexts.
Dataset Splits	Yes	For the experiments in this paper, we split a subset of our captions into a 114,000 utterance training set, a 2,400 utterance development set, and a 2,400 utterance testing set, covering a 27,891 word vocabulary (as speciﬁed by the Google ASR).
Hardware Specification	Yes	All models were trained on an NVIDIA Titan X GPU, which usually took about 2 days.
Software Dependencies	No	The paper mentions software components like VGG 16 layer network [29], Spoke Java Script framework [27], Google Speech Recognition service, and Kaldi [28], but it does not provide specific version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup	Yes	Each minibatch consists of B ground truth pairs... In practice, we set our minibatch size to 128, used a constant momentum of 0.9, and ran SGD training for 50 epochs. Learning rates took a bit of tuning to get right. In the end, we settled on an initial value of 1e-5, and employed a schedule which decreased the learning rate by a factor between 2 and 5 every 5 to 10 epochs.