UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation

Authors: Zhengkun Zhang, Xiaojun Meng, Yasheng Wang, Xin Jiang, Qun Liu, Zhenglu Yang11757-11764

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results show that our best model achieves a new state-of-the-art result on a large-scale benchmark dataset. The newly involved extractive objective as well as the knowledge distillation technique are proven to bring a noticeable improvement to the multimodal summarization task.
Researcher Affiliation Collaboration Zhengkun Zhang1*, Xiaojun Meng2, Yasheng Wang2, Xin Jiang2, Qun Liu2, Zhenglu Yang1 1TKLNDST, CS, Nankai University, China, 2 Noah s Ark Lab, Huawei Technologies zhangzk2017@mail.nankai.edu.cn, {xiaojun.meng, wangyasheng, Jiang.Xin, qun.liu}@huawei.com, yangzl@nankai.edu.cn
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets Yes We use the MSMO dataset, which is collected by (Zhu et al. 2018) for multimodal summarization.
Dataset Splits Yes The dataset includes 293,965 training pairs, 10,355 validation pairs, and 10,261 test pairs.
Hardware Specification No The paper does not specify any hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies Yes Our framwork is built on bart-base 2 version of BART (Lewis et al. 2020) with its initialized parameters and tokenizer. We use the released Vi T-B-32 3 version of CLIP (Radford et al. 2021) as the teacher network for knowledge distillation.
Experiment Setup Yes All models are trained for 30,000 steps with 750 steps for warm-up. Model checkpoints are saved and evaluated on the validation set every 2,000 steps. For abstractive summarization, we use beam search (size 5) in decoding, and tune the α for the length penalty (Wu et al. 2016) between 1.6 and 2.0 on the validation set; we decode until an end-of-sequence token is emitted.