Convolutional Auto-encoding of Sentence Topics for Image Paragraph Generation

Authors: Jing Wang, Yingwei Pan, Ting Yao, Jinhui Tang, Tao Mei

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on Stanford image paragraph dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, CAE-LSTM increases CIDEr performance from 20.93% to 25.15%.
Researcher Affiliation Collaboration Jing Wang1 , Yingwei Pan2 , Ting Yao2 , Jinhui Tang1 and Tao Mei2 1School of Computer Science and Engineering, Nanjing University of Science and Technology, China 2JD AI Research, Beijing, China
Pseudocode No The paper describes the architecture and processes of CAE and CAE-LSTM but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes We conducted the experiments and evaluated our CAE-LSTM on Stanford image paragraph dataset (Stanford) [Krause et al., 2017], a benchmark in the field of image paragraph generation.
Dataset Splits Yes In our experiments, we follow the widely used settings in [Krause et al., 2017] and take 14,575 images for training, 2,487 for validation and 2,489 for testing.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions software components like "Faster R-CNN", "VGG16", "LSTM", and refers to specific frameworks like "Microsoft COCO Evaluation Server" for metrics. However, it does not provide specific version numbers for any key software dependencies (e.g., "Python 3.8", "PyTorch 1.9").
Experiment Setup Yes Settings. For each image, we apply Faster R-CNN to detect objects within this image and select top M = 50 regions with highest detection confidences to represent the image... The maximum sentence number K is 6 and the maximum word number in a sentence is 20 (padded where necessary). For our CAE, the convolutional filter size in the convolutional layer is set as C1 = 26 with stride C2 = 2. The dimensions of the embedded region-level feature and distilled topic vector are set as D1 = 1, 024 and D2 = 500. For the two-level LSTM networks, the dimension of hidden state in each LSTM is H = 1, 000. The dimension of the hidden layer for measuring attention distribution is D3 = 512... Implementation Details. ...we set the learning rate as 1 10 4... For the second phrase of self-critical training, the learning rate is set as 5 10 6... The tradeoff parameter β is set as 8 according to the validation performance. Note that Batch normalization [Ioffe and Szegedy, 2015] and dropout [Srivastava et al., 2014] (dropout rate: 0.5) are applied in our experiments.