Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Multi-scale Hierarchical Residual Network for Dense Captioning
Authors: Yan Tian, Xun Wang, Jiachen Wu, Ruili Wang, Bailin Yang
JAIR 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results have shown that our approach outperforms most current methods. In this section, we compare the efficiency and the performance of the proposed approach with others. We conduct experiments on a workstation with an Intel i7-4790 3.6 GHz CPU, 32GB memory, and an NVIDIA GTX Titan X graphics. We conduct extensive ablation experiments and demonstrate the effects of several important components in our framework. All experiments in this subsection are performed on the Visual Genome V1.0 dataset. |
| Researcher Affiliation | Academia | Yan Tian EMAIL Xun Wang EMAIL Jiachen Wu EMAIL Ruili Wang EMAIL Bailin Yang EMAIL School of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou 310014, P.R.China |
| Pseudocode | No | The paper includes figures illustrating model architectures (Figure 1, 2, 3, 4) and mathematical equations, but no explicit pseudocode or algorithm blocks are present. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | Finally, the performance of the approach on the Visual Genome V1.0 dataset and the region labelled MS-COCO (Microsoft Common Objects in Context) dataset are demonstrated. We verified our proposed approach on the Visual Genome dataset(Krishna et al., 2017) and partial Microsoft Common Objects in Context (MS-COCO) dataset (Lin et al., 2014). |
| Dataset Splits | Yes | For the purpose of comparison, our experiments are mainly based on the Visual Genome V1.0 dataset. We use 77398 images for training and 5000 images for validation and testing which is same to the train/val/test splits in (Johnson et al., 2016). MS-COCO is the largest dataset regarding image captioning, with 82,783 images for training, 40,504 images for validation and 40,775 images for testing. |
| Hardware Specification | Yes | We conduct experiments on a workstation with an Intel i7-4790 3.6 GHz CPU, 32GB memory, and an NVIDIA GTX Titan X graphics. |
| Software Dependencies | Yes | We build our algorithm upon Torch 7 (Collobert, Kavukcuoglu, & Farabet, 2011) to test the performance and computational efficiency. |
| Experiment Setup | Yes | The min-batch size is 1, and each input image is first resized to a longer side of 720 pixels. We initialize Conv1 and Blocks 1-4 with weights that are pretrained on Image Net (Deng et al., 2009) and all other weights from a Gaussian with a standard deviation of 0.01. Stochastic gradient descent is used. We set the momentum to 0.9, and the initial rate to 0.001 which is halved every 100k iterations. Weight decay is not employed in training. Fully connected layers (FC1 and FC2) have rectified linear units and are regularized with Dropout. An LSTM with 256 hidden nodes is employed for sequential modeling. we set α = 0.1 and β = 0.05 during experiments. |