Joint Modeling of Visual Objects and Relations for Scene Graph Generation

Authors: Minghao Xu, Meng Qu, Bingbing Ni, Jian Tang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on both the relationship retrieval and zero-shot relationship retrieval tasks prove the efficiency and efficacy of our proposed approach.
Researcher Affiliation Academia 1Shanghai Jiao Tong University, Shanghai 200240, China 2Mila Québec AI Institute 3University of Montréal 4HEC Montréal 5CIFAR AI Research Chair {xuminghao118, nibingbing}@sjtu.edu.cn meng.qu@umontreal.ca jian.tang@hec.ca
Pseudocode Yes Algorithm 1 Inference algorithm of JM-SGG.
Open Source Code No Our method is implemented under Py Torch [25], and the source code will be released for reproducibility.
Open Datasets Yes We use the Visual Genome (VG) dataset [16] (CC BY 4.0 License), a large-scale database with structured image concepts, for evaluation. We use the pre-processed VG from Xu et al. [48] (MIT License) which contains 108k images with 150 object categories and 50 relation types.
Dataset Splits Yes Following previous works [53, 36, 37], we employ the original split with 70% images for training and 30% images for test, and 5k images randomly sampled from the training split are held out for validation.
Hardware Specification Yes An NVIDIA Tesla V100 GPU is used for training.
Software Dependencies No Our method is implemented under Py Torch [25]. The paper mentions PyTorch but does not specify a version number or other software dependencies with version information.
Experiment Setup Yes In our experiments, the object detector is first pre-trained by an SGD optimizer (batch size: 4, initial learning rate: 0.001, momentum: 0.9, weight decay: 5 10 4) for 20 epochs, and the learning rate is multiplied by 0.1 after the 10th epoch. During maximum likelihood learning, we train the potential functions and fine-tune the object detector with another SGD optimizer (batch size: 4, potential function learning rate: 0.001, detector learning rate: 0.0001, momentum: 0.9, weight decay: 5 10 4) for 10 epochs, and the learning rate is multiplied by 0.1 after the 5th epoch. Without otherwise specified, the iteration number NT is set as 1 for training and 2 for test, and the per image sampling size NS is set as 3.