Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning
Authors: Jin Chen, Xiaofeng Ji, Xinxiao Wu276-284
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiment results on two benchmark video datasets demonstrate the effectiveness of our method. Extensive experiments on the benchmark dataset have validated the effectiveness of our method. |
| Researcher Affiliation | Academia | Jin Chen, Xiaofeng Ji, Xinxiao Wu* Beijing Laboratory of Intelligent Information Technology School of Computer Science, Beijing Institute of Technology, Beijing, China {chen jin, jixf, wuxinxiao}@bit.edu.cn |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is publicly available or provide a link to it. |
| Open Datasets | Yes | To evaluate the proposed method, we conduct experiments on two video benchmark datasets, i.e., the Vid VRD dataset (Shang et al. 2017) and the Vid OR dataset (Shang et al. 2019). With the Vid VRD dataset as the target domain, we use the VRD dataset (Lu et al. 2016) as the source image domain. With the Vid OR dataset as the target video domain, we use the VG dataset (Zhang et al. 2017) as the source image domain. |
| Dataset Splits | No | The paper mentions training data and that target video annotations are only used for evaluation, but it does not provide specific numerical percentages or counts for training, validation, and test splits, nor does it refer to specific predefined splits by name. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using Faster R-CNN and ResNet101, but it does not specify version numbers for these or any other key software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | The shorter side of images and video frames is resized into 600 while preserving its aspect ratio. The dimension of the second-order statistic descriptor is set to 512 and the hyperparameter r in the factorized bilinear pooling is set to 5. The domain classifier Dimg and the instance domain classifier Dins are designed using five fully-connected layers (1024 512 256 128 1) and three convolution layers (512 128 1), respectively. The visual mapping φ and the language mapping ϕ consist of three fullyconnected layers (256 256 300) and two fully-connect layers (1024 300), respectively. During test, we use non maximum suppression with an Io U threshold of 0.3 to select boxes from object proposals and then take the selected boxes with a confidence score greater than 0.5 as the final detected objects to predicate relationships. |