Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video

Authors: Zenan Xu, Xiaojun Meng, Yasheng Wang, Qinliang Su, Zexuan Qiu, Xin Jiang, Qun Liu

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three public multimodal datasets show that our method outperforms all competing baselines. Furthermore, with the advantages of summary-worthy visual information, our model can have a significant improvement on small datasets or even datasets with limited training data.
Researcher Affiliation Collaboration Zenan Xu1 , Xiaojun Meng2 , Yasheng Wang2 , Qinliang Su1,4 , Zexuan Qiu3 , Xin Jiang2 and Qun Liu2 1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 2Noah s Ark Lab, Huawei Technologie 3The Chinese University of Hong Kong, Hong Kong SAR 4Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, China {xuzn@mail2, suqliang@mail}.sysu.edu.cn, qzexuan@link.cuhk.edu.hk, {xiaojun.meng, wangyasheng, Jiang.Xin, qun.liu}@huawei.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes We evaluate the proposed π‘†π‘Šπ‘‰π‘…on three public datasets, including How2, How2-300 [Sanabria et al., 2018], and MM-AVS [Fu et al., 2021] dataset. The statistic of datasets is shown in Table 1.
Dataset Splits Yes The statistic of datasets is shown in Table 1. (Table 1 provides 'Train Dev Test' splits for How2, How2-300, and MM-AVS datasets with specific counts, e.g., How2: 68336 Train, 2520 Dev, 2127 Test).
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper mentions 'BART-base model' as the backbone and 'Adam' as the optimizer, but it does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages.
Experiment Setup Yes The BART-base model is adopted as the backbone of our model, in which 𝐿= 6 for both the encoder and decoder. For the introduced auxiliary visual encoder, we use a 6-layer encoder with 8 attention heads and a 768 feed-forward dimension. Following previous work [Yu et al., 2021a], we set the max length of the generated summary to be 64 tokens; the decoding process can be stopped early if an End-of-Sequence (EOS) token is emitted. The Adam [Kingma and Ba, 2014] with 𝛽1 = 0.9, 𝛽2 = 0.999, and a weight decay of 1𝑒 5 is employed as the optimizer.