Image Caption with Global-Local Attention
Authors: Linghui Li, Sheng Tang, Lixi Deng, Yongdong Zhang, Qi Tian
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed GLA method can generate more relevant sentences, and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular metrics. We conduct experiments on the well-known MS COCO caption dataset, a popular large scale dataset. |
| Researcher Affiliation | Academia | 1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100039, China 3Department of Computer Science, University of Texas at San Antonio, TX 78249-1604 {lilinghui,ts,denglixi,zhyd}@ict.ac.cn,qitian@cs.utsa.edu |
| Pseudocode | No | The paper describes algorithmic steps and equations, particularly for the LSTM and attention mechanism, but it does not present them in a formal, structured 'Pseudocode' or 'Algorithm' block or figure. |
| Open Source Code | No | The paper states 'We implement our global-local attention model based on LRCN framwork (Donahue et al. 2015), an open-source implementation of RNN,' which refers to the use of a third-party open-source tool, not the release of the authors' own implementation code. No link or explicit statement about the availability of their specific code is provided. |
| Open Datasets | Yes | We conduct experiments on the well-known MS COCO caption dataset, a popular large scale dataset. In order to fairly compare with existing methods, we keep the same splits as the previous work (Karpathy and Fei-Fei 2015) 5,000 images for validation and another 5,000 images from validation for testing. |
| Dataset Splits | Yes | This dataset contains 82,783 images for training and 40,504 images for validation. Each image is associated with 5-sentence annotated in English by AMT workers. ... In order to fairly compare with existing methods, we keep the same splits as the previous work (Karpathy and Fei-Fei 2015) 5,000 images for validation and another 5,000 images from validation for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only discusses the models and frameworks utilized. |
| Software Dependencies | No | The paper mentions software components and frameworks like 'VGG16', 'Faster R-CNN', and 'LRCN framwork', but it does not specify any version numbers for these or other software dependencies. |
| Experiment Setup | Yes | The learning rate is initially set to 0.01 and then is decreased by step. For sentence generation, there are two strategies for sampling sentence of a given image. ... Particularly, we can obtain the best run when the value of k is set to 3. ... The above objective function is optimized over the whole training caption set by using stochastic gradient descent with a momentum of 0.9. When we train our two-layer LSTM language model with global-local attention mechanism, we note that there would be overfitting which does not appear in experiments with only using global features. Dropout is an import mechanism for regularizing deep network to reduce overfitting. ... Besides, we also add one linear transform layer to reduce the integrated 4096-dimension feature to 1000-dimension to keep consistent with the dimension of LSTM hidden layer |