Multimodal Neural Language Models
Authors: Ryan Kiros, Ruslan Salakhutdinov, Rich Zemel
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentation is performed on three datasets with image-text descriptions: IAPR TC-12, Attributes Discovery, and the SBU datasets. We further illustrate capabilities of our models through quantitative retrieval evaluation and visualizations of our results. |
| Researcher Affiliation | Academia | Ryan Kiros RKIROS@CS.TORONTO.EDU Ruslan Salakhutdinov RSALAKHU@CS.TORONTO.EDU Richard Zemel ZEMEL@CS.TORONTO.EDU Department of Computer Science, University of Toronto Canadian Institute for Advanced Research |
| Pseudocode | No | The paper describes algorithms and models in text and diagrams (Figure 2) but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states that the Attributes Discovery dataset split and SBU word embeddings 'will be made publicly available', but does not explicitly state that the source code for their methodology is available or provide a link. |
| Open Datasets | Yes | We perform experimental evaluation of our proposed models on three publicly available datasets: IAPR TC-12 This data set consists of 20,000 images... We used a publicly available train/test split for our experiments. Attribute Discovery This dataset contains roughly 40,000 images... We used a random train/test split for our experiments which will be made publicly available. SBU Captioned Photos We obtained a subset of roughly 400,000 images from the SBU dataset (Ordonez et al., 2011)... |
| Dataset Splits | Yes | For each of our experiments, we split the training set into 80% training and 20% validation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments, only mentioning general computational aspects like 'gradients from the loss could then be backpropagated from the language model through the convolutional network to update filter weights'. |
| Software Dependencies | No | The paper mentions using pre-trained embeddings of Turian et al. (2010), but does not specify software dependencies with version numbers. |
| Experiment Setup | Yes | Each of our language models were trained using the following hyperparameters: all context matrices used a weight decay of 1.0 10 4 while word representations used a weight decay of 1.0 10 5. All other weight matrices, including the convolutional network filters use a weight decay of 1.0 10 4. We used batch sizes of 20 and an initial learning rate of 0.2 (averaged over the minibatch) which was exponentially decreased at each epoch by a factor of 0.998. Gated methods used an initial learning rate of 0.02. Initial momentum was set to 0.5 and was increased linearly to 0.9 over 20 epochs. The word representation matrices were initialized to the 50 dimensional pre-trained embeddings of Turian et al. (2010). We used a context size of 5 for each of our models. ... Since features used have varying dimensionality, an additional layer was added to map images to 256 dimensions, so that across all experiments the input size to the bias and gating units are equivalent. |