reproducibilityindex.ai

Disentangled Motif-aware Graph Learning for Phrase Grounding

Authors: Zongshen Mu, Siliang Tang, Jie Tan, Qiang Yu, Yueting Zhuang13587-13594

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the efﬁciency of disentangled and interventional graph network (DIGN) through a series of ablation studies, and our model achieves state-of-the-art performance on Flickr30K Entities and Refer It Game benchmarks.
Researcher Affiliation	Collaboration	Zongshen Mu1, Siliang Tang1 , Jie Tan1, Qiang Yu2, Yueting Zhuang1 1DCD Lab, College of Computer Science, Zhejiang University 2City Cloud Technology (China) Co., Ltd {zongshen,siliang,tanjie95,yzhuang}@zju.edu.cn, yq@citycloud.com.cn
Pseudocode	Yes	The pseudocode of the interventional process is listed in Appendix B.
Open Source Code	No	The paper provides links to third-party tools used (e.g., 'https://pypi.org/project/pytorch-pretrained-bert/' for BERT and 'https://github.com/rowanz/neural-motifs' for visual scene graph generation) but does not state that the source code for their own proposed DIGN model is openly available.
Open Datasets	Yes	We validate our model on two common datasets for phrase grounding. Flickr30K Entities (Plummer et al. 2015) contains 31,783 images where each image corresponds to ﬁve captions with annotated noun phrases. ... Refer It Game (Kazemzadeh et al. 2014) contains 20,000 images along with 99,535 segmented image regions.
Dataset Splits	Yes	We divide the dataset into 30k images for training, 1k for validation, and 1k for testing.
Hardware Specification	No	The paper does not provide any specific details regarding the hardware used for running the experiments (e.g., GPU models, CPU types, memory specifications).
Software Dependencies	No	The paper mentions using specific tools like 'pre-trained BERT (Devlin et al. 2019)' and a 'java toolkit' for scene graph parsing, along with links to some repositories. However, it does not provide specific version numbers for the key software components or libraries (e.g., PyTorch version, Java version for the toolkit, Python version).
Experiment Setup	Yes	All the dimensions of d T in, d V in, d T out and d V out are set as 512. For phrase and visual disentangled graph networks, the number of neighbor rooting layer is 2. ... We set K to 4. We only use one layer of Transformer with 4 multiheads as cross-modal mapping. For the Info NCE loss, the hyperparameter τ is set to 0.2. We train the end-to-end network by the SGD optimizer with learning rate 1e-3, weight decay 1e-4, and momentum 0.9. The model iterates 6 epochs with batch size 32.