Temporal Sentence Grounding with Relevance Feedback in Videos
Authors: Jianfeng Dong, Xiaoman Peng, Daizong Liu, Xiaoye Qu, Xun Yang, Cuizhu Bao, Meng Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate our Ra TSG network, we reconstruct two popular TSG datasets, establishing a rigorous benchmark for TSG-RF. Experimental results demonstrate the effectiveness of our proposed Ra TSG for the TSG-RF task. |
| Researcher Affiliation | Academia | Jianfeng Dong1 2 Xiaoman Peng1 Daizong Liu3 Xiaoye Qu4 Xun Yang5 Cuizhu Bao1 2 Meng Wang6 1Zhejiang Gongshang University 2Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology 3Peking University 4Huazhong University of Science and Technology 5University of Science and Technology of China 6Hefei University of Technology |
| Pseudocode | No | The paper describes methods and processes but does not include structured pseudocode or algorithm blocks with formal labels. |
| Open Source Code | Yes | Our source code is available at https://github.com/Hui Guan Lab/Ra TSG. |
| Open Datasets | Yes | We reconstruct two popular TSG datasets, establishing a rigorous benchmark for TSG-RF... The Charades-STA [22] dataset comprises 6,672 videos... The Activity Net Captions [23] dataset consists of approximately 20,000 videos... For ease of reference, we name the corresponding reconstructed datasets as Charades-RF and Activity Net-RF, respectively. |
| Dataset Splits | Yes | For each sample in both validation and test sets, we add a corresponding sample without grounding result, resulting in 1:1 ratio of samples with and without grounding results. ... the training set contains 12,408 video-text sample pairs, while the test set comprises 3,720 pairs. Since the original Charades-STA does not have a validation set, we randomly halve the original test samples to form a validation set and a test set. ... the training set includes 37,421 video-text pairs, while the validation and test sets contain 17,505 and 17,031 samples, respectively. After reconstruction, the validation and test sets were augmented with an equal number of samples without grounding results, doubling the total number of sample pairs to 35,010 and 34,062, respectively. |
| Hardware Specification | Yes | All experiments are conducted on a workstation with an NVIDIA Ge Force RTX 3090Ti GPU and 256G RAM. |
| Software Dependencies | No | The paper mentions using Glo Ve300d for word initialization and I3D network for visual features, and BERT model [62] for feature extraction, but it does not specify version numbers for these software components or any other libraries/frameworks. |
| Experiment Setup | Yes | Each word in the query text is initialized using Glo Ve300d, which remains frozen during training. Visual features of videos are extracted using a pre-trained I3D network. The maximum video feature sequence length is set to 128. Sequences longer than this are uniformly downsampled to 128, while shorter sequences are zero-padded to the same length. During training, since the training set was not reconstructed and only contains samples with grounding results, each batch includes randomly selected videos paired with the original query text to create samples without grounding results. In Equation 3, we empirically set β = 6 and γ = 6 to balance all loss functions at the start of training. For the threshold m in Equation 4, the value providing the highest accuracy on the validation set is chosen: 0.5 for Charades-RF and 0.3 for Activity Net-RF. |