reproducibilityindex.ai

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

Authors: Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Jing Yuan

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on a large-scale spatio-temporal video grounding dataset Vid STG [Zhang et al., 2020]. The overall experiment results are shown in Table 1 and we can find some interesting points: On the whole, the grounding performance of all models for interrogative sentences is lower than for declarative sentences, validating the unknown objects without explicit characteristics are more difficult to ground. For temporal grounding, region-level methods STGRN and OMRN outperform the frame-level methods TALL and L-Net, which demonstrates the fine-grained region modeling is beneficial to determine the accurate temporal boundaries of target tubes. For spatio-temporal grounding, the Ground R+{ } approaches ignore the temporal dynamics of objects and achieve the worst performance, suggesting it is crucial to capture the object dynamics among frames for high-quality spatio-temporal video grounding. In all criteria, our OMRN achieves the remarkable performance improvements compared with baselines. This fact shows our method can effectively focus on the notable regions by object-aware multi-branch region modeling with the diversity loss and capture critical object relations by multi-branch reasoning. Furthermore, given the temporal ground truth segment during inference, we compare the spatial grounding ability of our OMRN method with baselines. The results are shown in Table 2 and we do not separate declarative and interrogative sentences here. We can see that our OMRN still achieves the apparent performance improvement on all criteria, especially for v Io U@0.3. This demonstrates our OMRN approach is still effective when applied to aligned segment-sentence data. We next verify the contribution of each part of our method by ablation study. We remove one key component at a time to generate an ablation model. The object-aware multi-branch modeling is vital in our method, so we first remove the object-aware modulation from each branch as w/o. OM. We then discard the diversity loss from the multi-task loss, denoted by w/o. DL. Further, we remove the cross-modal matching from all branches and discard the weighting terms ˆdn 1k and ˆdn tl in multi-branch relation reasoning, denoted by w/o. CM. Note that in this ablation model, the diversity loss is also ineffective due to the lack the matching score distributions. Next, we develop the ablation study on the basic region and object modeling. We discard the temporal region aggregation from region modeling as w/o. TA and remove the context attention during object extraction as w/o. CA. The ablation results are shown in Table 3. We can find all ablation models have the performance degradation compared with the full model, showing each above component is helpful to improve the grounding accuracy. And the ablation models w/o. OM, w/o. DL and w/o. CM have the lower accuracy than w/o. TA and w/o. CA, which suggests the object-aware multi-branch relation reasoning plays a crucial role in high-quality spatio-temporal grounding. Moreover, the model w/o. CM achieves the worst performance, validating the cross-modal matching with the diversity regularization is very important to incorporate language-relevant region features from auxiliary branches into the main branch. To qualitatively validate the effectiveness of our OMRN method, we display a typical example in Figure 3. The sentence describes a short-term state of the dog and requires to capture object-aware fine-grained relations. By intuitive comparison, our OMRN can retrieve the more accurate temporal boundaries and spatio-temporal tube of the dog than the best baseline STGRN. Furthermore, we display the objectregion matching score distribution in the example, where we visualize the matching scores between three objects (i.e. dog , rope and man ) and the regions of the 4-th frame. Although there are a woman and another dog in the frame, our method can still eliminate the interference and focus on the notable region containing the corresponding object, e.g., the object dog assigns a higher score to the 2-th region rather than the 6-th region.
Researcher Affiliation	Collaboration	Zhu Zhang1 , Zhou Zhao1 , Zhijie Lin1 , Baoxing Huai2 and Jing Yuan2 1College of Computer Science, Zhejiang University, China 2Huawei Cloud & AI, China
Pseudocode	No	The paper describes the proposed method using text and mathematical equations but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We conduct experiments on a large-scale spatio-temporal video grounding dataset Vid STG [Zhang et al., 2020], which is constructed from the video object relation dataset Vid OR [Shang et al., 2019] by annotating the natural language descriptions.
Dataset Splits	Yes	Vid STG contains 5,563, 618 and 743 videos in the training, validation and testing sets, totaling 6,924 videos.
Hardware Specification	No	The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using a 'pre-trained Glove embedding', 'NLTK' and an 'Adam optimizer' but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	As for modeling setting, we set α to 0.6, L to 5 and set λ1, λ2, λ3 and λ4 to 1.0, 1.0, 0.001 and 1.0, respectively. We define H = 9 candidate segments at each step with temporal widths [3, 9, 17, 33, 65, 97, 129, 165, 197]. We set the dimensions of all projection matrices and biases to 256 and set the hidden state of each direction in Bi GRU to 128. We employ an Adam optimizer with the initial learning rate 0.0005.