Disentangled Motif-aware Graph Learning for Phrase Grounding

Authors: Zongshen Mu, Siliang Tang, Jie Tan, Qiang Yu, Yueting Zhuang13587-13594

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the efficiency of disentangled and interventional graph network (DIGN) through a series of ablation studies, and our model achieves state-of-the-art performance on Flickr30K Entities and Refer It Game benchmarks.
Researcher Affiliation Collaboration Zongshen Mu1, Siliang Tang1 , Jie Tan1, Qiang Yu2, Yueting Zhuang1 1DCD Lab, College of Computer Science, Zhejiang University 2City Cloud Technology (China) Co., Ltd {zongshen,siliang,tanjie95,yzhuang}@zju.edu.cn, yq@citycloud.com.cn
Pseudocode Yes The pseudocode of the interventional process is listed in Appendix B.
Open Source Code No The paper provides links to third-party tools used (e.g., 'https://pypi.org/project/pytorch-pretrained-bert/' for BERT and 'https://github.com/rowanz/neural-motifs' for visual scene graph generation) but does not state that the source code for their own proposed DIGN model is openly available.
Open Datasets Yes We validate our model on two common datasets for phrase grounding. Flickr30K Entities (Plummer et al. 2015) contains 31,783 images where each image corresponds to five captions with annotated noun phrases. ... Refer It Game (Kazemzadeh et al. 2014) contains 20,000 images along with 99,535 segmented image regions.
Dataset Splits Yes We divide the dataset into 30k images for training, 1k for validation, and 1k for testing.
Hardware Specification No The paper does not provide any specific details regarding the hardware used for running the experiments (e.g., GPU models, CPU types, memory specifications).
Software Dependencies No The paper mentions using specific tools like 'pre-trained BERT (Devlin et al. 2019)' and a 'java toolkit' for scene graph parsing, along with links to some repositories. However, it does not provide specific version numbers for the key software components or libraries (e.g., PyTorch version, Java version for the toolkit, Python version).
Experiment Setup Yes All the dimensions of d T in, d V in, d T out and d V out are set as 512. For phrase and visual disentangled graph networks, the number of neighbor rooting layer is 2. ... We set K to 4. We only use one layer of Transformer with 4 multiheads as cross-modal mapping. For the Info NCE loss, the hyperparameter τ is set to 0.2. We train the end-to-end network by the SGD optimizer with learning rate 1e-3, weight decay 1e-4, and momentum 0.9. The model iterates 6 epochs with batch size 32.