Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
Authors: Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, Xiaoying Tang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on various VTG tasks and datasets demonstrate the superior performance of TRACE compared to state-of-the-art video LLMs. Our model and code are avaliable at https://github.com/gyxxyg/TRACE. We conduct comprehensive experiments on multiple VTG tasks and datasets to verify the effectiveness of TRACE. The results reveal significant improvements of TRACE in comparison to SOTA video LLMs. Notably, TRACE improves zero-shot performance by 3.1 and 4.9% on Youcook2 (CIDEr and F1 Score), by 6.5% and 3.7% in Recall (IOU = {0.5, 0.7}) on Charades STA, and by 10.3% and 9.2% for m AP and HIT@1 on QVHighlights. |
| Researcher Affiliation | Collaboration | Yongxin Guo1 Jingyu Liu2 Mingda Li2 Qingbin Liu2 Xi Chen2, Xiaoying Tang1,3,4, 1School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China 2Tencent PCG 3Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS), Shenzhen, China 4Guangdong Provincial Key Laboratory of Future Networks of Intelligence, Shenzhen, China |
| Pseudocode | No | The paper describes the TRACE model and its components using textual descriptions and figures (e.g., Figure 2 for overview, Figure 3 for token sequence, Figure 4 for generation process) rather than structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our model and code are avaliable at https://github.com/gyxxyg/TRACE. |
| Open Datasets | Yes | We evaluate the model performance on three different tasks: Dense video caption. We use Youcook2 (Zhou et al., 2018) and Activity Net Captions (Fabian Caba Heilbron & Niebles, 2015) datasets as the evaluation datasets. Moment retrieval. We utilize test set of Charades-STA (Gao et al., 2017) for the moment retrieval task. Video highlight detection. We employ the validation set of the QVHighlights dataset (Lei et al., 2021). ...Stage 1 primarily utilizes two groups of datasets. Image and video caption datasets for initializing the visual compression layer. This group of datasets including Valley (Luo et al., 2023b), LLa VA Image (Liu et al., 2024), Text VR (Wu et al., 2025), and a randomly sampled subset of Share GPT4Video (Chen et al., 2024a) datasets. VTG datasets for task encoder/head initialization. We use VTG-IT dataset in this group. ...Stage 2 primarily utilizes three groups of datasets. VTG instruction tuning datasets for enhancing VTG capacity. We use VTG-IT (Guo et al., 2024), Activity Net Captions (Fabian Caba Heilbron & Niebles, 2015), and a subset of Intern Vid (Wang et al., 2023b). ...Video question answering datasets to enhance TRACE s reasoning capabilities. We use Video Chat GPT (Maaz et al., 2023) and Next-QA (Xiao et al., 2021) in this part. |
| Dataset Splits | Yes | We utilize test set of Charades-STA (Gao et al., 2017) for the moment retrieval task and report the recall at IOU thresholds of 0.5 and 0.7. Additionally, we present the m IOU results. We employ the validation set of the QVHighlights dataset (Lei et al., 2021) and report the mean average precision (m AP) with IOU thresholds of 0.5 and 0.75, as well as the HIT@1, which represents the hit ratio of the highest scored clip. For each video, the content is uniformly divided into 128 clips, with one frame randomly sampled from each clip. |
| Hardware Specification | Yes | Table 6: Detailed training setting and hyper-parameters. Setting: Stage 1 / Stage 2. Computation: 16 ATN 910B / 16 ATN 910B. |
| Software Dependencies | Yes | Table 6: Detailed training setting and hyper-parameters. Setting: Stage 1 / Stage 2. LLM: Mistral-7B-v0.2 / Mistral-7B-v0.2. Vision Encoder: openai/clip-vit-large-patch14-336 / openai/clip-vit-large-patch14-336. |
| Experiment Setup | Yes | Table 6: Detailed training setting and hyper-parameters. Setting: Stage 1 / Stage 2. Batch Size: 128 / 128. Num Frames: 128 / 128. Train Epochs: 1 / 2. Learning Rate: 1e-3 / 5e-6. LR Scheduler: Cosine / Cosine. Model Max Length: 4096 / 4096. |