Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video
Authors: Zenan Xu, Xiaojun Meng, Yasheng Wang, Qinliang Su, Zexuan Qiu, Xin Jiang, Qun Liu
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three public multimodal datasets show that our method outperforms all competing baselines. Furthermore, with the advantages of summary-worthy visual information, our model can have a significant improvement on small datasets or even datasets with limited training data. |
| Researcher Affiliation | Collaboration | Zenan Xu1 , Xiaojun Meng2 , Yasheng Wang2 , Qinliang Su1,4 , Zexuan Qiu3 , Xin Jiang2 and Qun Liu2 1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 2Noah s Ark Lab, Huawei Technologie 3The Chinese University of Hong Kong, Hong Kong SAR 4Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, China {xuzn@mail2, suqliang@mail}.sysu.edu.cn, qzexuan@link.cuhk.edu.hk, {xiaojun.meng, wangyasheng, Jiang.Xin, qun.liu}@huawei.com |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We evaluate the proposed ππππ on three public datasets, including How2, How2-300 [Sanabria et al., 2018], and MM-AVS [Fu et al., 2021] dataset. The statistic of datasets is shown in Table 1. |
| Dataset Splits | Yes | The statistic of datasets is shown in Table 1. (Table 1 provides 'Train Dev Test' splits for How2, How2-300, and MM-AVS datasets with specific counts, e.g., How2: 68336 Train, 2520 Dev, 2127 Test). |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions 'BART-base model' as the backbone and 'Adam' as the optimizer, but it does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages. |
| Experiment Setup | Yes | The BART-base model is adopted as the backbone of our model, in which πΏ= 6 for both the encoder and decoder. For the introduced auxiliary visual encoder, we use a 6-layer encoder with 8 attention heads and a 768 feed-forward dimension. Following previous work [Yu et al., 2021a], we set the max length of the generated summary to be 64 tokens; the decoding process can be stopped early if an End-of-Sequence (EOS) token is emitted. The Adam [Kingma and Ba, 2014] with π½1 = 0.9, π½2 = 0.999, and a weight decay of 1π 5 is employed as the optimizer. |