Large-Scale Visual Relationship Understanding

Authors: Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, Mohamed Elhoseiny9185-9194

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop a new relationship detection model that embeds objects and relations into two vector spaces where both discriminative capability and semantic affinity are preserved. ... We demonstrate the efficacy of our model on a large and imbalanced benchmark based of Visual Genome that comprises 53, 000+ objects and 29, 000+ relations, a scale at which no previous work has been evaluated at. We show superiority of our model over competitive baselines on the original Visual Genome dataset with 80, 000+ categories. We also show state-of-the-art performance on the VRD dataset and the scene graph dataset which is a subset of Visual Genome with 200 categories.
Researcher Affiliation Collaboration Ji Zhang,1,2 Yannis Kalantidis,1 Marcus Rohrbach,1 Manohar Paluri,1 Ahmed Elgammal,2 Mohamed Elhoseiny1 1Facebook Research 2Department of Computer Science, Rutgers University
Pseudocode No The paper describes the model and training process in text and diagrams but does not include structured pseudocode or algorithm blocks.
Open Source Code No We will release the cleaned annotations along with our code.
Open Datasets Yes We present experiments on three datasets, the original Visual Genome (VG80k) (Krishna et al., 2017), the version of Visual Genome with 200 categories (VG200) (Xu et al., 2017), and Visual Relationship Detection (VRD) dataset (Lu et al., 2016).
Dataset Splits Yes We follow Johnson, Karpathy, and Fei-Fei (2016) and split the data into 103, 077 training images and 5, 000 testing images. ... We further split the training set into 97, 961 training and 2, 000 validation images.
Hardware Specification No For all the three datasets, we train our model for 7 epochs using 8 GPUs.
Software Dependencies No We initialize each branch with weights pre-trained on COCO Lin et al. (2014). For the word vectors, we used the gensim library ˇReh uˇrek and Sojka (2010) for both word2vec and node2vec1 Grover and Leskovec (2016).
Experiment Setup Yes For all the three datasets, we train our model for 7 epochs using 8 GPUs. We set learning rate as 0.001 for the first 5 epochs and 0.0001 for the rest 2 epochs. We initialize each branch with weights pre-trained on COCO Lin et al. (2014). For the word vectors, we used the gensim library ˇReh uˇrek and Sojka (2010) for both word2vec and node2vec1 Grover and Leskovec (2016). For the triplet loss, we set m = 0.2 as the default value. ... We set the scalar to 3.2 for VG80k and 3.0 for VRD in all experiments.