reproducibilityindex.ai

Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network

Authors: Haowei Wang, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Xiaoshuai Sun

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the PNG benchmark dataset demonstrate the effectiveness and efficiency of our method. Compared to the single-stage baseline, our method achieves a significant improvement of up to 9.4% accuracy. More importantly, our EPNG is 10 times faster than the two-stage model.
Researcher Affiliation	Collaboration	Haowei Wang1, Jiayi Ji1, Yiyi Zhou1, 2, Yongjian Wu4, Xiaoshuai Sun1, 2, 3 1Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University, 361005, China. 2Institute of Artificial Intelligence, Xiamen University, China. 3Fujian Engineering Research Center of Trusted Artificial Intelligence Analysis and Application, Xiamen University, China. 4Tencent Youtu Lab, Shanghai, China
Pseudocode	No	The paper describes the architecture and components of EPNG but does not provide any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The source codes and trained models for all our experiments are publicly available at https://github.com/Mr-Neko/EPNG.git.
Open Datasets	Yes	Datasets We train and compare our model with the existing method on the Panoptic Narrative Grounding dataset (Gonz alez et al. 2021). It is consist of images and the corresponding text. ... The dataset includes a total of 133,103 training images and 8,380 test images with 875,073 and 56,531 segmentation annotations, respectively.
Dataset Splits	No	The dataset includes a total of 133,103 training images and 8,380 test images with 875,073 and 56,531 segmentation annotations, respectively. The paper does not explicitly define a separate validation dataset split.
Hardware Specification	Yes	We train it on 4 RTX3090 GPUs, which cost 20 hours in total.
Software Dependencies	No	The paper mentions using 'Res Net-101' and 'pre-trained BERT' as backbones but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch/TensorFlow, CUDA versions) used for their implementation.
Experiment Setup	Yes	For parallel training, we increase the input image resolution to 640 640, so the shapes of the last three layers are 20 20 256, 40 40 256, and 80 80 256, respectively. Moreover, the dimension of text features is 768. The number of attention heads is 8 and the hidden dimension is 2048. Besides, the number of Layers S is 3. In terms of hyperparameters, we use λ1 = 2, λ2 = 2 and λ3 = 1 to balance the final loss. We set the initial learning rate η = 1e 5 which is half decayed by every 5 epochs, and fix η = 5e 7 after 10 epochs. The batch size is 32.