reproducibilityindex.ai

Cross-Category Highlight Detection via Feature Decomposition and Modality Alignment

Authors: Zhenduo Zhang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, the extensive experimental results on three challenging public benchmarks validate the efficacy of our paradigm and the superiority over the existing state-of-the-art approaches to video highlight detection. We conduct extensive experiments on popular video highlight benchmarks to validate the effectiveness and superiority of our paradigm. Experiment Settings and Compared Methods We evaluate our approaches on three popular benchmark datasets
Researcher Affiliation	Industry	Zhenduo Zhang Platform Technology Department, OVBU, PCG, Tencent, China ericzdzhang@163.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The source code will be released.
Open Datasets	Yes	We evaluate our approaches on three popular benchmark datasets, i.e., You Tube Highlights (Sun et al. 2016), TVSum (Song et al. 2015) and Co Sum (Chu, Song, and Jaimes 2015), for video highlight detection. We use the Vi T-32 (Dosovitskiy et al. 2020) pretrained on the Kinect-400 (Carreira and Zisserman 2017) as the video encoder to extract visual features of the sampled frames and use the PANN (Kong et al. 2020) pretrained on the Audio Set (Gemmeke et al. 2017) to extract the audio embeddings of audio clips.
Dataset Splits	No	The paper mentions evaluating on benchmark datasets but does not provide specific details on training, validation, and test dataset splits (e.g., percentages, counts, or explicit standard splits for their experiments).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using pre-trained models (Vi T-32, PANN) and an optimizer (Adam), but it does not specify software dependencies with version numbers (e.g., programming language, libraries, frameworks).
Experiment Setup	Yes	We sample 16 frames uniformly from the video frames of a video. We train our model using Adam, with a learning rate of 1 10 4. The weights of the losses in Equation 9 are {λi}3 i=1 = {1.0, 0.5, 1.0}, which are selected by Grid Search strategy. The size of the highlight feature set fed into the self-attention layer in the MFDB module, which is the NG mentioned above, is set to 16.