Temporal Sentence Grounding with Relevance Feedback in Videos

Authors: Jianfeng Dong, Xiaoman Peng, Daizong Liu, Xiaoye Qu, Xun Yang, Cuizhu Bao, Meng Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our Ra TSG network, we reconstruct two popular TSG datasets, establishing a rigorous benchmark for TSG-RF. Experimental results demonstrate the effectiveness of our proposed Ra TSG for the TSG-RF task.
Researcher Affiliation Academia Jianfeng Dong1 2 Xiaoman Peng1 Daizong Liu3 Xiaoye Qu4 Xun Yang5 Cuizhu Bao1 2 Meng Wang6 1Zhejiang Gongshang University 2Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology 3Peking University 4Huazhong University of Science and Technology 5University of Science and Technology of China 6Hefei University of Technology
Pseudocode No The paper describes methods and processes but does not include structured pseudocode or algorithm blocks with formal labels.
Open Source Code Yes Our source code is available at https://github.com/Hui Guan Lab/Ra TSG.
Open Datasets Yes We reconstruct two popular TSG datasets, establishing a rigorous benchmark for TSG-RF... The Charades-STA [22] dataset comprises 6,672 videos... The Activity Net Captions [23] dataset consists of approximately 20,000 videos... For ease of reference, we name the corresponding reconstructed datasets as Charades-RF and Activity Net-RF, respectively.
Dataset Splits Yes For each sample in both validation and test sets, we add a corresponding sample without grounding result, resulting in 1:1 ratio of samples with and without grounding results. ... the training set contains 12,408 video-text sample pairs, while the test set comprises 3,720 pairs. Since the original Charades-STA does not have a validation set, we randomly halve the original test samples to form a validation set and a test set. ... the training set includes 37,421 video-text pairs, while the validation and test sets contain 17,505 and 17,031 samples, respectively. After reconstruction, the validation and test sets were augmented with an equal number of samples without grounding results, doubling the total number of sample pairs to 35,010 and 34,062, respectively.
Hardware Specification Yes All experiments are conducted on a workstation with an NVIDIA Ge Force RTX 3090Ti GPU and 256G RAM.
Software Dependencies No The paper mentions using Glo Ve300d for word initialization and I3D network for visual features, and BERT model [62] for feature extraction, but it does not specify version numbers for these software components or any other libraries/frameworks.
Experiment Setup Yes Each word in the query text is initialized using Glo Ve300d, which remains frozen during training. Visual features of videos are extracted using a pre-trained I3D network. The maximum video feature sequence length is set to 128. Sequences longer than this are uniformly downsampled to 128, while shorter sequences are zero-padded to the same length. During training, since the training set was not reconstructed and only contains samples with grounding results, each batch includes randomly selected videos paired with the original query text to create samples without grounding results. In Equation 3, we empirically set β = 6 and γ = 6 to balance all loss functions at the start of training. For the threshold m in Equation 4, the value providing the highest accuracy on the validation set is chosen: 0.5 for Charades-RF and 0.3 for Activity Net-RF.