reproducibilityindex.ai

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

Authors: Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Yuille

ICLR 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods.
Researcher Affiliation	Collaboration	Junhua Mao University of California, Los Angeles; Baidu Research mjhustc@ucla.edu Wei Xu & Yi Yang & Jiang Wang & Zhiheng Huang Baidu Research {wei.xu,yangyi05,wangjiang03,huangzhiheng}@baidu.com Alan Yuille University of California, Los Angeles yuille@stat.ucla.edu
Pseudocode	No	The paper describes the model architecture with equations and diagrams (Figure 2), but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	The code and related data (e.g. reﬁned image features and hypotheses sentences generated by the m-RNN model) are available at https://github.com/mjhucla/m RNN-CR.
Open Datasets	Yes	We test our method on four benchmark datasets with sentence level annotations: IAPR TC-12 (Grubinger et al. (2006)), Flickr 8K (Rashtchian et al. (2010)), Flickr 30K (Young et al. (2014)) and MS COCO (Lin et al. (2014)). The dataset partition of MS COCO and Flickr30K is available in the project page 4. www.stat.ucla.edu/ junhua.mao/m-RNN.html.
Dataset Splits	Yes	We adopt the standard separation of training and testing set as previous works (Guillaumin et al. (2010); Kiros et al. (2014b)) with 17,665 images for training and 1962 images for testing. There are 6,000 images for training, 1,000 images for validation and 1,000 images for testing. We randomly sampled 4,000 images for validation and 1,000 images for testing from their currently released validation set.
Hardware Specification	No	The m-RNN model is trained using Baidu s internal deep learning platform PADDLE, which allows us to explore many different model architectures in a short period. It takes 25 ms on average to generate a sentence (excluding image feature extraction stage) on a single core CPU.
Software Dependencies	No	The m-RNN model is trained using Baidu s internal deep learning platform PADDLE, which allows us to explore many different model architectures in a short period.
Experiment Setup	Yes	The hyperparameters, such as layer dimensions and the choice of the non-linear activation functions, are tuned via cross-validation on Flickr8K dataset and are then ﬁxed across all the experiments. After the recurrent layer, we set up a 512 dimensional multimodal layer that connects the language model part and the vision part of the m-RNN model (see Figure 2(b)).