Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
Authors: Hang Hua, Yunlong Tang, Chenliang Xu, Jiebo Luo
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that V2Xum LLa MA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks. We evaluate our V2Xum-LLa MA model, both 7B and 13B versions, against various models on V2V, V2T, and V2VT summarization tasks using the Video Xum dataset. Baseline models include LLM-based approaches such as Frozen BLIP (Li et al. 2023), VSUM-BLIP (Lin et al. 2023a), TSUMBLIP (Lin et al. 2023a), and VTSUM-BLIP (Lin et al. 2023a). We compare with task-specific-head-free (TSH-Free) models like DENSE (Krishna et al. 2017), DVC-D-A (Li et al. 2018), Bi-LSTM+Tempo Attn (Zhou et al. 2018b), Masked Transformer (Zhou et al. 2018b), and Support-Set (Patrick et al. 2020). |
| Researcher Affiliation | Academia | Hang Hua*, Yunlong Tang*, Chenliang Xu, Jiebo Luo University of Rochester EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology in prose and includes architectural diagrams, but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Other implementation details are included in our technical appendices (Hua et al. 2024b). |
| Open Datasets | Yes | Additionally, on the classical TVSum (Song et al. 2015) and Sum Me (Gygli et al. 2014) datasets, we compare our 7B version V2Xum LLa MA with the following V2V summarization methods... For cross-modal video summarization, adopt the Video Xum dataset (Lin et al. 2023a) and our proposed V2Xum dataset. For V2V summarization, we used the TVSum (Song et al. 2015) and Sum Me (Gygli et al. 2014) benchmarks. |
| Dataset Splits | Yes | To address these issues, we propose Instruct-V2Xum, a new large-scale cross-model video summarization dataset that contains 30k open domain videos, partitioned as 25,000 in the training set, 1,000 in the validation set, and 4,000 in the test set. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions the models used for the vision encoder and text decoder. |
| Software Dependencies | No | The paper mentions using specific models like "CLIP Vi T-L/14@336 as the vision encoder and Vicuna-v1.5-7B/13B as the text decoder" but does not specify software dependencies with version numbers (e.g., PyTorch 1.x, Python 3.x, CUDA 11.x). |
| Experiment Setup | No | During training, all the parameters of the vision encoder and update the vision adapter, and the language decoder are freezed. We then train the model end-to-end using negative log-likelihood loss: D i=1 log p(Ax i | S, Ax i 1). The paper describes some high-level training strategies (like freezing parameters and the loss function) but does not provide specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings in the main text. |