Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network

Authors: Haowei Wang, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Xiaoshuai Sun

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the PNG benchmark dataset demonstrate the effectiveness and efficiency of our method. Compared to the single-stage baseline, our method achieves a significant improvement of up to 9.4% accuracy. More importantly, our EPNG is 10 times faster than the two-stage model.
Researcher Affiliation Collaboration Haowei Wang1*, Jiayi Ji1*, Yiyi Zhou1, 2, Yongjian Wu4, Xiaoshuai Sun1, 2, 3 1Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University, 361005, China. 2Institute of Artificial Intelligence, Xiamen University, China. 3Fujian Engineering Research Center of Trusted Artificial Intelligence Analysis and Application, Xiamen University, China. 4Tencent Youtu Lab, Shanghai, China
Pseudocode No The paper describes the architecture and components of EPNG but does not provide any structured pseudocode or algorithm blocks.
Open Source Code Yes The source codes and trained models for all our experiments are publicly available at https://github.com/Mr-Neko/EPNG.git.
Open Datasets Yes Datasets We train and compare our model with the existing method on the Panoptic Narrative Grounding dataset (Gonz alez et al. 2021). It is consist of images and the corresponding text. ... The dataset includes a total of 133,103 training images and 8,380 test images with 875,073 and 56,531 segmentation annotations, respectively.
Dataset Splits No The dataset includes a total of 133,103 training images and 8,380 test images with 875,073 and 56,531 segmentation annotations, respectively. The paper does not explicitly define a separate validation dataset split.
Hardware Specification Yes We train it on 4 RTX3090 GPUs, which cost 20 hours in total.
Software Dependencies No The paper mentions using 'Res Net-101' and 'pre-trained BERT' as backbones but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch/TensorFlow, CUDA versions) used for their implementation.
Experiment Setup Yes For parallel training, we increase the input image resolution to 640 640, so the shapes of the last three layers are 20 20 256, 40 40 256, and 80 80 256, respectively. Moreover, the dimension of text features is 768. The number of attention heads is 8 and the hidden dimension is 2048. Besides, the number of Layers S is 3. In terms of hyperparameters, we use λ1 = 2, λ2 = 2 and λ3 = 1 to balance the final loss. We set the initial learning rate η = 1e 5 which is half decayed by every 5 epochs, and fix η = 5e 7 after 10 epochs. The batch size is 32.