UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation
Authors: Zhengkun Zhang, Xiaojun Meng, Yasheng Wang, Xin Jiang, Qun Liu, Zhenglu Yang11757-11764
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results show that our best model achieves a new state-of-the-art result on a large-scale benchmark dataset. The newly involved extractive objective as well as the knowledge distillation technique are proven to bring a noticeable improvement to the multimodal summarization task. |
| Researcher Affiliation | Collaboration | Zhengkun Zhang1*, Xiaojun Meng2, Yasheng Wang2, Xin Jiang2, Qun Liu2, Zhenglu Yang1 1TKLNDST, CS, Nankai University, China, 2 Noah s Ark Lab, Huawei Technologies zhangzk2017@mail.nankai.edu.cn, {xiaojun.meng, wangyasheng, Jiang.Xin, qun.liu}@huawei.com, yangzl@nankai.edu.cn |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology. |
| Open Datasets | Yes | We use the MSMO dataset, which is collected by (Zhu et al. 2018) for multimodal summarization. |
| Dataset Splits | Yes | The dataset includes 293,965 training pairs, 10,355 validation pairs, and 10,261 test pairs. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | Yes | Our framwork is built on bart-base 2 version of BART (Lewis et al. 2020) with its initialized parameters and tokenizer. We use the released Vi T-B-32 3 version of CLIP (Radford et al. 2021) as the teacher network for knowledge distillation. |
| Experiment Setup | Yes | All models are trained for 30,000 steps with 750 steps for warm-up. Model checkpoints are saved and evaluated on the validation set every 2,000 steps. For abstractive summarization, we use beam search (size 5) in decoding, and tune the α for the length penalty (Wu et al. 2016) between 1.6 and 2.0 on the validation set; we decode until an end-of-sequence token is emitted. |