Iterative Scene Graph Generation
Authors: Siddhesh Khandelwal, Leonid Sigal
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on Visual Genome [30] and Action Genome [25] benchmark datasets we show improved performance on the scene graph generation task. |
| Researcher Affiliation | Academia | Siddhesh Khandelwal1,2 and Leonid Sigal1,2,3 1Department of Computer Science, University of British Columbia 2Vector Institute for AI 3CIFAR AI Chair {skhandel, lsigal}@cs.ubc.ca |
| Pseudocode | No | The paper describes the architecture and formulation but does not provide pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | The code is available at github.com/ubc-vision/Iterative SG. |
| Open Datasets | Yes | Through extensive experiments on Visual Genome [30] and Action Genome [25] benchmark datasets we show improved performance on the scene graph generation task. Visual Genome [30] is licensed under the Creative Commons Attribution 4.0 International License. Action Genome [25] is licensed under the MIT license. |
| Dataset Splits | No | We use widely popular data splits for our experiments. We briefly describe this in Section 5. The hyperparameters and additional data details are also mentioned in the supplementary (Section B). |
| Hardware Specification | No | Hardware resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute1. Additional support was provided by JELF CFI grant and Compute Canada under the RAC award. |
| Software Dependencies | No | We use Res Net-101 [22] as the backbone network for image feature extraction. |
| Experiment Setup | Yes | Implementation Details (transformer-based approach). We use Res Net-101 [22] as the backbone network for image feature extraction. Each of the subject, object, and predicate decoders have 6 layers, with a feature size of 256. The decoders use 300 queries. For training we use a batch size of 12 and initial learning rate of 10^4, which is gradually decayed. |