Unsupervised Learning of Spoken Language with Visual Context
Authors: David Harwath, Antonio Torralba, James Glass
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For the experiments in this paper, we split a subset of our captions into a 114,000 utterance training set, a 2,400 utterance development set, and a 2,400 utterance testing set, covering a 27,891 word vocabulary (as specified by the Google ASR). The average caption duration was 9.5 seconds, and contained an average of 21.9 words. All the sets were randomly sampled, so many of the same speakers will appear in all three sets. |
| Researcher Affiliation | Academia | David Harwath, Antonio Torralba, and James R. Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02115 {dharwath, torralba, jrg}@csail.mit.edu |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper states: 'We plan to make our dataset publicly available in the near future.' This refers to the dataset, not the code, and indicates future availability rather than current. No statement is made about releasing the source code for their methodology. |
| Open Datasets | Yes | Since we desire spontaneously spoken audio captions, we collected a new corpus of captions for the Places205 dataset [23]. Places205 contains over 2.5 million images categorized into 205 different scene classes, providing a rich variety of object types in many different contexts. |
| Dataset Splits | Yes | For the experiments in this paper, we split a subset of our captions into a 114,000 utterance training set, a 2,400 utterance development set, and a 2,400 utterance testing set, covering a 27,891 word vocabulary (as specified by the Google ASR). |
| Hardware Specification | Yes | All models were trained on an NVIDIA Titan X GPU, which usually took about 2 days. |
| Software Dependencies | No | The paper mentions software components like VGG 16 layer network [29], Spoke Java Script framework [27], Google Speech Recognition service, and Kaldi [28], but it does not provide specific version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | Each minibatch consists of B ground truth pairs... In practice, we set our minibatch size to 128, used a constant momentum of 0.9, and ran SGD training for 50 epochs. Learning rates took a bit of tuning to get right. In the end, we settled on an initial value of 1e-5, and employed a schedule which decreased the learning rate by a factor between 2 and 5 every 5 to 10 epochs. |