Multimodal Neural Language Models

Authors: Ryan Kiros, Ruslan Salakhutdinov, Rich Zemel

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentation is performed on three datasets with image-text descriptions: IAPR TC-12, Attributes Discovery, and the SBU datasets. We further illustrate capabilities of our models through quantitative retrieval evaluation and visualizations of our results.
Researcher Affiliation Academia Ryan Kiros RKIROS@CS.TORONTO.EDU Ruslan Salakhutdinov RSALAKHU@CS.TORONTO.EDU Richard Zemel ZEMEL@CS.TORONTO.EDU Department of Computer Science, University of Toronto Canadian Institute for Advanced Research
Pseudocode No The paper describes algorithms and models in text and diagrams (Figure 2) but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper states that the Attributes Discovery dataset split and SBU word embeddings 'will be made publicly available', but does not explicitly state that the source code for their methodology is available or provide a link.
Open Datasets Yes We perform experimental evaluation of our proposed models on three publicly available datasets: IAPR TC-12 This data set consists of 20,000 images... We used a publicly available train/test split for our experiments. Attribute Discovery This dataset contains roughly 40,000 images... We used a random train/test split for our experiments which will be made publicly available. SBU Captioned Photos We obtained a subset of roughly 400,000 images from the SBU dataset (Ordonez et al., 2011)...
Dataset Splits Yes For each of our experiments, we split the training set into 80% training and 20% validation.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments, only mentioning general computational aspects like 'gradients from the loss could then be backpropagated from the language model through the convolutional network to update filter weights'.
Software Dependencies No The paper mentions using pre-trained embeddings of Turian et al. (2010), but does not specify software dependencies with version numbers.
Experiment Setup Yes Each of our language models were trained using the following hyperparameters: all context matrices used a weight decay of 1.0 10 4 while word representations used a weight decay of 1.0 10 5. All other weight matrices, including the convolutional network filters use a weight decay of 1.0 10 4. We used batch sizes of 20 and an initial learning rate of 0.2 (averaged over the minibatch) which was exponentially decreased at each epoch by a factor of 0.998. Gated methods used an initial learning rate of 0.02. Initial momentum was set to 0.5 and was increased linearly to 0.9 over 20 epochs. The word representation matrices were initialized to the 50 dimensional pre-trained embeddings of Turian et al. (2010). We used a context size of 5 for each of our models. ... Since features used have varying dimensionality, an additional layer was added to map images to 256 dimensions, so that across all experiments the input size to the bias and gating units are equivalent.