Diagram Visual Grounding: Learning to See with Gestalt-Perceptual Attention
Authors: Xin Hu, Lingling Zhang, Jun Liu, Xinyu Zhang, Wenjun Wu, Qianying Wang
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By conducting comprehensive experiments on diagrams and natural images, we demonstrate that the proposed model achieves superior performance over the competitors. We conduct experiments on a diagram dataset AI2D and a natural image dataset Flickr30K Entities. The experimental results indicate that the proposed GPA modal achieves the best accuracy in the diagram visual grounding task, and also obtains a comparable accuracy over the competitors in natural images. |
| Researcher Affiliation | Collaboration | Xin Hu1,2 , Lingling Zhang1,2 , Jun Liu1,2 , Xinyu Zhang1,2 , Wenjun Wu1,2 and Qianying Wang 3 1 Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, School of Computer Science and Technology, Xi an Jiaotong University, China 2 National Engineering Lab for Big Data Analytics, Xi an Jiaotong University, China 3 Lenovo Research, Beijing, China |
| Pseudocode | No | The paper describes the steps of the model but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described, nor does it mention a specific repository link or explicit code release statement. |
| Open Datasets | Yes | We evaluate the GPA model on an AI2D dataset with diagrams and a Flickr30K Entities dataset with natural images. AI2D is a dataset focusing on the scientific topic of primary and secondary schools... We split AI2D into a train set with 1,634 diagrams and a test set with 404 diagrams. Flickr30k Entities. In addition to diagram visual grounding, the framework of GPA model is also applicable to processing natural images. To this end, we select a benchmark Flickr30k Entities [Plummer et al., 2015]. |
| Dataset Splits | Yes | We split AI2D into a train set with 1,634 diagrams and a test set with 404 diagrams. We follow the same split as in the previous works [Deng et al., 2021; Yang et al., 2022] for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments, only general training information. |
| Software Dependencies | No | Our model is implemented using Py Torch. ...linguistic embedding is initialized with BERT. The paper mentions PyTorch and BERT but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For fair comparison, we resize the visual input into 640 × 640 × 3 and follow the pervious works [Deng et al., 2021; Yang et al., 2022] to perform data augmentation. The maximum length of the language expression is set to 40. ...When training the GPA model, we use Adam for parameter optimization with an initial learning rate of 10−4. We set the learning rate for visual backbone network and linguistic BERT to 10−5 and the weight decay is 10−4. For comparison with the baseline models, we extend the training epochs to 90, and decay the learning rate by 10 after 60 epochs. In Eq. (10), we set 0.5 as the initial value of α, β, and γ. To avoid overfitting, we exploit dropout operation after the multi-head attention layer and the dropout rate is set to 0.1 by default. The evaluation of GPA model is in the same way as VLTVG [Yang et al., 2022] and we set λgiou = 2 and λL1 = 5. |