Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

Authors: Junhua Mao, Jiajing Xu, Kevin Jing, Alan L. Yuille

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our model benefits from incorporating the visual information into the word embeddings, and a weight sharing strategy is crucial for learning such multimodal embeddings.
Researcher Affiliation Collaboration Junhua Mao1 Jiajing Xu2 Yushi Jing2 Alan Yuille1,3 1University of California, Los Angeles 2Pinterest Inc. 3Johns Hopkins University
Pseudocode No The paper describes the model architecture and components but does not provide any pseudocode or algorithm blocks.
Open Source Code No The paper states: 'The project page is: http://www.stat.ucla.edu/~junhua.mao/multimodal_embedding.html1.' and footnote 1 says 'The datasets introduced in this work will be gradually released on the project page.' This mentions dataset release, not explicit release of the source code for the methodology.
Open Datasets Yes More specifically, we introduce a large-scale dataset with 300 million sentences describing over 40 million images crawled and downloaded from publicly available Pins (i.e. an image with sentence descriptions uploaded by users) on Pinterest [2]. ... We denote this dataset as the Pinterest40M dataset. ... To facilitate research in this area, we will gradually release the datasets proposed in this paper on our project page.
Dataset Splits Yes We train the models until the loss does not decrease on a small validation set with 10,000 images and their descriptions.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions 'Python s stemmer package' but does not specify its version or any other software dependencies with version numbers.
Experiment Setup Yes We use the stochastic gradient descent method with a mini-batch size of 256 sentences and a learning rate of 1.0. The gradient is clipped to 10.0. We train the models until the loss does not decrease on a small validation set with 10,000 images and their descriptions. The models will scan the dataset for roughly five 5 epochs. The bias terms of the gates (i.e. br and bu in Eqn. 1 and 2) in the GRU layer are initialized to 1.0.