Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Authors: Hang Hua, Yunlong Tang, Chenliang Xu, Jiebo Luo

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that V2Xum LLa MA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks. We evaluate our V2Xum-LLa MA model, both 7B and 13B versions, against various models on V2V, V2T, and V2VT summarization tasks using the Video Xum dataset. Baseline models include LLM-based approaches such as Frozen BLIP (Li et al. 2023), VSUM-BLIP (Lin et al. 2023a), TSUMBLIP (Lin et al. 2023a), and VTSUM-BLIP (Lin et al. 2023a). We compare with task-specific-head-free (TSH-Free) models like DENSE (Krishna et al. 2017), DVC-D-A (Li et al. 2018), Bi-LSTM+Tempo Attn (Zhou et al. 2018b), Masked Transformer (Zhou et al. 2018b), and Support-Set (Patrick et al. 2020).
Researcher Affiliation Academia Hang Hua*, Yunlong Tang*, Chenliang Xu, Jiebo Luo University of Rochester EMAIL, EMAIL
Pseudocode No The paper describes the methodology in prose and includes architectural diagrams, but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Other implementation details are included in our technical appendices (Hua et al. 2024b).
Open Datasets Yes Additionally, on the classical TVSum (Song et al. 2015) and Sum Me (Gygli et al. 2014) datasets, we compare our 7B version V2Xum LLa MA with the following V2V summarization methods... For cross-modal video summarization, adopt the Video Xum dataset (Lin et al. 2023a) and our proposed V2Xum dataset. For V2V summarization, we used the TVSum (Song et al. 2015) and Sum Me (Gygli et al. 2014) benchmarks.
Dataset Splits Yes To address these issues, we propose Instruct-V2Xum, a new large-scale cross-model video summarization dataset that contains 30k open domain videos, partitioned as 25,000 in the training set, 1,000 in the validation set, and 4,000 in the test set.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions the models used for the vision encoder and text decoder.
Software Dependencies No The paper mentions using specific models like "CLIP Vi T-L/14@336 as the vision encoder and Vicuna-v1.5-7B/13B as the text decoder" but does not specify software dependencies with version numbers (e.g., PyTorch 1.x, Python 3.x, CUDA 11.x).
Experiment Setup No During training, all the parameters of the vision encoder and update the vision adapter, and the language decoder are freezed. We then train the model end-to-end using negative log-likelihood loss: D i=1 log p(Ax i | S, Ax i 1). The paper describes some high-level training strategies (like freezing parameters and the loss function) but does not provide specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings in the main text.