TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection

Authors: Hao Sun, Mingyao Zhou, Wenjing Chen, Wei Xie

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on QVHighlights, Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing state-of-the-art methods.
Researcher Affiliation Academia Hao Sun1,2,3*, Mingyao Zhou1,2,3*, Wenjing Chen4 , Wei Xie1,2,3 1Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China 2School of Computer Science, Central China Normal University, Wuhan, China 3National Language Resources Monitoring and Research Center for Network Media, Central China Normal University, Wuhan, China 4School of Computer Science, Hubei University of Technology, Wuhan, China haosun@ccnu.edu.cn, zhoumingyao@mails.ccnu.edu.cn, chenwenjing@hbut.edu.cn, XW@mail.ccnu.edu.cn
Pseudocode No The paper includes a figure illustrating the proposed model architecture and mathematical equations, but it does not contain pseudocode or a clearly labeled algorithm block.
Open Source Code Yes Codes are available at https: //github.com/mingyao1120/TR-DETR.
Open Datasets Yes QVHighlights dataset (Lei, Berg, and Bansal 2021) comprises 10,148 content-rich videos from You Tube. Charades-STA dataset (Gao et al. 2017) contains 9,848 videos capturing daily indoor activities and 16,128 human-tagged query texts. TVSum dataset (Song et al. 2015) is a benchmark dataset for HD.
Dataset Splits Yes For QVHighlights, we visualize the qualitative analysis results of TR-DETR on the QVHighlights val set. (Figure 2 caption). Setting (b) to (e) show the performance of each component on the baseline model compared to setting (a). Setting (f) demonstrates the existence of task reciprocity. Compared with setting (c), the reason for the performance degradation in setting (h) may be the semantic mismatch between modalities, resulting in mutual degradation of tasks. Setting (i) shows the huge performance improvement of the proposed local-global alignment loss combined with visual feature refinement. (Table 4 caption referencing QVHighlights val set). For Charades-STA, we allocate 12,408 samples for training while the remaining 3,720 samples are for testing. For TVSum, 80% of the dataset is utilized for training and the remaining for testing.
Hardware Specification Yes Moreover, all our experiments are conducted on Nvidia RTX 4090 and Gen Intel(R) Core(TM) i7-12700 CPU.
Software Dependencies No The paper mentions several pre-trained networks and feature extractors such as CLIP, Slow Fast, and PANN, but it does not specify their version numbers or other general software dependencies like programming languages or deep learning frameworks with their versions.
Experiment Setup Yes The hidden layer dimension d is 256, and λlg is set to 0.3. The training phase involves 200 epochs, a batch size of 32, and a learning rate of 1e-4. For TVSum, Training spans 2000 epochs with a batch size of 4 and a learning rate of 1e-3. In Charades-STA, The training phase includes 100 epochs, a batch size of 8, and a learning rate of 1e-4.