Improving Scene Graph Classification by Exploiting Knowledge from Texts

Authors: Sahand Sharifzadeh, Sina Moayed Baharlou, Martin Schmitt, Hinrich Schütze, Volker Tresp2189-2197

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that by fine-tuning the classification pipeline with the extracted knowledge from texts, we can achieve 8x more accurate results in scene graph classification, 3x in object classification, and 1.5x in predicate classification, compared to the supervised baselines with only 1% of the annotated images. We evaluate our approach on the Visual Genome dataset.
Researcher Affiliation Collaboration Sahand Sharifzadeh1*, Sina Moayed Baharlou1* , Martin Schmitt2, Hinrich Schütze2, Volker Tresp 1,3 1 Department of Informatics, LMU Munich, Germany 2 Center for Information and Language Processing (CIS), LMU Munich, Germany 3 Siemens AG, Munich, Germany
Pseudocode Yes Algorithm 1: Classify objects/predicates from images; Algorithm 2: Fine-tune the relational reasoning component from textual triples using a denoising auto-encoder paradigm
Open Source Code No The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We use the sanitized version [Xu et al. 2017] of Visual Genome (VG) dataset [Krishna et al. 2017] including images and their annotations, i.e., bounding boxes, scene graphs, and scene descriptions.
Dataset Splits No The paper specifies 'training images' (1% or 10% of VG data) and 'test sets' but does not explicitly define a separate validation set with specific percentages or counts.
Hardware Specification No The paper does not specify the hardware used to run the experiments, such as particular GPU or CPU models.
Software Dependencies No The paper mentions models and architectures like 'Res Net-50', 'Graph Transformer layers', and 'T5small model' but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup Yes To this end, we assume only a random proportion (1% or 10%) of training images are annotated (parallel set containing IM with corresponding SG and TXT). We consider the remaining data (99% or 90%) as our text set and discard their IM and SG. We use four different random splits [Sharifzadeh, Baharlou, and Tresp 2021] to avoid a sampling bias. We fine-tune the pre-trained T5 model on parallel TXT and SG. Randomly set 20% of the nodes and edges in E to zero.