Cross-Category Highlight Detection via Feature Decomposition and Modality Alignment
Authors: Zhenduo Zhang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, the extensive experimental results on three challenging public benchmarks validate the efficacy of our paradigm and the superiority over the existing state-of-the-art approaches to video highlight detection. We conduct extensive experiments on popular video highlight benchmarks to validate the effectiveness and superiority of our paradigm. Experiment Settings and Compared Methods We evaluate our approaches on three popular benchmark datasets |
| Researcher Affiliation | Industry | Zhenduo Zhang Platform Technology Department, OVBU, PCG, Tencent, China ericzdzhang@163.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The source code will be released. |
| Open Datasets | Yes | We evaluate our approaches on three popular benchmark datasets, i.e., You Tube Highlights (Sun et al. 2016), TVSum (Song et al. 2015) and Co Sum (Chu, Song, and Jaimes 2015), for video highlight detection. We use the Vi T-32 (Dosovitskiy et al. 2020) pretrained on the Kinect-400 (Carreira and Zisserman 2017) as the video encoder to extract visual features of the sampled frames and use the PANN (Kong et al. 2020) pretrained on the Audio Set (Gemmeke et al. 2017) to extract the audio embeddings of audio clips. |
| Dataset Splits | No | The paper mentions evaluating on benchmark datasets but does not provide specific details on training, validation, and test dataset splits (e.g., percentages, counts, or explicit standard splits for their experiments). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using pre-trained models (Vi T-32, PANN) and an optimizer (Adam), but it does not specify software dependencies with version numbers (e.g., programming language, libraries, frameworks). |
| Experiment Setup | Yes | We sample 16 frames uniformly from the video frames of a video. We train our model using Adam, with a learning rate of 1 10 4. The weights of the losses in Equation 9 are {λi}3 i=1 = {1.0, 0.5, 1.0}, which are selected by Grid Search strategy. The size of the highlight feature set fed into the self-attention layer in the MFDB module, which is the NG mentioned above, is set to 16. |