Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
Authors: Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Yuille
ICLR 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. |
| Researcher Affiliation | Collaboration | Junhua Mao University of California, Los Angeles; Baidu Research mjhustc@ucla.edu Wei Xu & Yi Yang & Jiang Wang & Zhiheng Huang Baidu Research {wei.xu,yangyi05,wangjiang03,huangzhiheng}@baidu.com Alan Yuille University of California, Los Angeles yuille@stat.ucla.edu |
| Pseudocode | No | The paper describes the model architecture with equations and diagrams (Figure 2), but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and related data (e.g. refined image features and hypotheses sentences generated by the m-RNN model) are available at https://github.com/mjhucla/m RNN-CR. |
| Open Datasets | Yes | We test our method on four benchmark datasets with sentence level annotations: IAPR TC-12 (Grubinger et al. (2006)), Flickr 8K (Rashtchian et al. (2010)), Flickr 30K (Young et al. (2014)) and MS COCO (Lin et al. (2014)). The dataset partition of MS COCO and Flickr30K is available in the project page 4. www.stat.ucla.edu/ junhua.mao/m-RNN.html. |
| Dataset Splits | Yes | We adopt the standard separation of training and testing set as previous works (Guillaumin et al. (2010); Kiros et al. (2014b)) with 17,665 images for training and 1962 images for testing. There are 6,000 images for training, 1,000 images for validation and 1,000 images for testing. We randomly sampled 4,000 images for validation and 1,000 images for testing from their currently released validation set. |
| Hardware Specification | No | The m-RNN model is trained using Baidu s internal deep learning platform PADDLE, which allows us to explore many different model architectures in a short period. It takes 25 ms on average to generate a sentence (excluding image feature extraction stage) on a single core CPU. |
| Software Dependencies | No | The m-RNN model is trained using Baidu s internal deep learning platform PADDLE, which allows us to explore many different model architectures in a short period. |
| Experiment Setup | Yes | The hyperparameters, such as layer dimensions and the choice of the non-linear activation functions, are tuned via cross-validation on Flickr8K dataset and are then fixed across all the experiments. After the recurrent layer, we set up a 512 dimensional multimodal layer that connects the language model part and the vision part of the m-RNN model (see Figure 2(b)). |