Convolutional Auto-encoding of Sentence Topics for Image Paragraph Generation
Authors: Jing Wang, Yingwei Pan, Ting Yao, Jinhui Tang, Tao Mei
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on Stanford image paragraph dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, CAE-LSTM increases CIDEr performance from 20.93% to 25.15%. |
| Researcher Affiliation | Collaboration | Jing Wang1 , Yingwei Pan2 , Ting Yao2 , Jinhui Tang1 and Tao Mei2 1School of Computer Science and Engineering, Nanjing University of Science and Technology, China 2JD AI Research, Beijing, China |
| Pseudocode | No | The paper describes the architecture and processes of CAE and CAE-LSTM but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We conducted the experiments and evaluated our CAE-LSTM on Stanford image paragraph dataset (Stanford) [Krause et al., 2017], a benchmark in the field of image paragraph generation. |
| Dataset Splits | Yes | In our experiments, we follow the widely used settings in [Krause et al., 2017] and take 14,575 images for training, 2,487 for validation and 2,489 for testing. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like "Faster R-CNN", "VGG16", "LSTM", and refers to specific frameworks like "Microsoft COCO Evaluation Server" for metrics. However, it does not provide specific version numbers for any key software dependencies (e.g., "Python 3.8", "PyTorch 1.9"). |
| Experiment Setup | Yes | Settings. For each image, we apply Faster R-CNN to detect objects within this image and select top M = 50 regions with highest detection confidences to represent the image... The maximum sentence number K is 6 and the maximum word number in a sentence is 20 (padded where necessary). For our CAE, the convolutional filter size in the convolutional layer is set as C1 = 26 with stride C2 = 2. The dimensions of the embedded region-level feature and distilled topic vector are set as D1 = 1, 024 and D2 = 500. For the two-level LSTM networks, the dimension of hidden state in each LSTM is H = 1, 000. The dimension of the hidden layer for measuring attention distribution is D3 = 512... Implementation Details. ...we set the learning rate as 1 10 4... For the second phrase of self-critical training, the learning rate is set as 5 10 6... The tradeoff parameter β is set as 8 according to the validation performance. Note that Batch normalization [Ioffe and Szegedy, 2015] and dropout [Srivastava et al., 2014] (dropout rate: 0.5) are applied in our experiments. |