Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

Authors: Xin Gu, Yaojie Shen, Chenxi Luo, Tiejian Luo, Yan Huang, YUEWEI LIN, Heng Fan, Libo Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments on three benchmarks, including HCSTVG-v1/-v2 and Vid STG, TA-STVG achieves state-of-the-art performance and significantly outperforms the baseline, validating its efficacy. Moreover, TTS and ASA are designed for general purpose. When applied to existing methods such as Tube DETR and STCAT, we show substantial performance gains, verifying its generality.
Researcher Affiliation Academia 1University of Chinese Academy of Sciences 2Institute of Software Chinese Academy of Sciences 3La Trobe University 4University of North Texas 5Brookhaven National Laboratory EMAIL, EMAIL
Pseudocode No The paper describes methods through text and diagrams (Figure 3, Figure 4) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is released at https://github.com/HengLan/TA-STVG.
Open Datasets Yes We use three datasets, i.e., HCSTVG-v1/v2 (Tang et al., 2021) and Vid STG (Zhang et al., 2020b), for experiments.
Dataset Splits Yes HCSTVG-v1 contains 5,660 untrimmed videos, with 4,500 and 1,160 video-text pairs in training and testing sets. HCSTVG-v2 expands upon HCSTVG-v1, and comprises 10,131 training, 2,000 validation, and 4,413 testing samples. ... Vid STG ... with training, validation, and test sets containing 80,684, 8,956, and 10,303 sentences, and 5,436, 602, and 732 videos, respectively.
Hardware Specification Yes The inference is conducted on a single A100 GPU, and inference time refers to the duration of a single forward propagation. ... Table 13: Comparison on model efficacy and complexity on Vid STG. The inference is conducted on a single A100 GPU, and inference time refers to the duration of a single forward propagation.
Software Dependencies No TA-STVG is implemented in Python using Py Torch (Paszke et al., 2019). Similar to (Gu et al., 2024a), we use pre-trained Res Net-101 (He et al., 2016) and Ro BERTa-base (Liu et al., 2019) from MDETR (Kamath et al., 2021) as 2D and text backbones, and Vid Swin-tiny (Liu et al., 2022) as 3D backbone. The version numbers for these software components are not provided.
Experiment Setup Yes The number of attention heads is 8, and the hidden dimension of the encoder and decoder is 256. The channel dimensions Ca, Cm, Ct, D are 2,048, 768, 768 and 256. δ and θ are empirically set to 0.5 and 0.7. We use random resized cropping as augmentation method, producing an output with a short side of 420. The video frame length Nv is 64, and the text sequence length Nt is 30. During training, we use Adam (Kingma & Ba, 2015) with an initial learning rate of 2e-5 for the pre-trained backbone and 3e-4 for the remaining modules (note that the 3D backbone is frozen). The loss weight parameters λTTS, λASA, λk, λl, and λu are set to 1, 1, 10, 5, and 3, respectively.