Improving Scene Graph Classification by Exploiting Knowledge from Texts
Authors: Sahand Sharifzadeh, Sina Moayed Baharlou, Martin Schmitt, Hinrich Schütze, Volker Tresp2189-2197
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that by fine-tuning the classification pipeline with the extracted knowledge from texts, we can achieve 8x more accurate results in scene graph classification, 3x in object classification, and 1.5x in predicate classification, compared to the supervised baselines with only 1% of the annotated images. We evaluate our approach on the Visual Genome dataset. |
| Researcher Affiliation | Collaboration | Sahand Sharifzadeh1*, Sina Moayed Baharlou1* , Martin Schmitt2, Hinrich Schütze2, Volker Tresp 1,3 1 Department of Informatics, LMU Munich, Germany 2 Center for Information and Language Processing (CIS), LMU Munich, Germany 3 Siemens AG, Munich, Germany |
| Pseudocode | Yes | Algorithm 1: Classify objects/predicates from images; Algorithm 2: Fine-tune the relational reasoning component from textual triples using a denoising auto-encoder paradigm |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We use the sanitized version [Xu et al. 2017] of Visual Genome (VG) dataset [Krishna et al. 2017] including images and their annotations, i.e., bounding boxes, scene graphs, and scene descriptions. |
| Dataset Splits | No | The paper specifies 'training images' (1% or 10% of VG data) and 'test sets' but does not explicitly define a separate validation set with specific percentages or counts. |
| Hardware Specification | No | The paper does not specify the hardware used to run the experiments, such as particular GPU or CPU models. |
| Software Dependencies | No | The paper mentions models and architectures like 'Res Net-50', 'Graph Transformer layers', and 'T5small model' but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | To this end, we assume only a random proportion (1% or 10%) of training images are annotated (parallel set containing IM with corresponding SG and TXT). We consider the remaining data (99% or 90%) as our text set and discard their IM and SG. We use four different random splits [Sharifzadeh, Baharlou, and Tresp 2021] to avoid a sampling bias. We fine-tune the pre-trained T5 model on parallel TXT and SG. Randomly set 20% of the nodes and edges in E to zero. |