DIUSum: Dynamic Image Utilization for Multimodal Summarization

Authors: Min Xiao, Junnan Zhu, Feifei Zhai, Yu Zhou, Chengqing Zong

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results have shown that DIUSum outperforms multiple strong baselines and achieves SOTA on two public multimodal summarization datasets.
Researcher Affiliation Collaboration 1 State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China 2 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 3 Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing, China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described.
Open Datasets Yes We experiment with the MMS (Li et al. 2018) and MSMO (Zhu et al. 2018) datasets.
Dataset Splits Yes MMS train 62,000 dev 2,000 test 2,000; MSMO train 293,964 dev 10,355 test 10,256
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions BERT and VGG-19 as models and Bert Adam as an optimizer, but does not provide specific version numbers for software dependencies (e.g., Python 3.x, PyTorch 1.x) needed for replication.
Experiment Setup Yes The batch size is set to 8. For MMS dataset, we use the max text encoding length of 60, and the max text decoding length is 20. For MSMO dataset, we use the max text encoding length of 300, and the max text decoding length is 120. We use the Bert Adam (Kingma and Ba 2014) optimizer and set the learning rate as 1e 4, with the warmup portion as 0.1. When calculating the edit distance of the first k tokens... k is set to 5 and 8 for MMS and MSMO, respectively. For MMS dataset, the training epoch for three stages is T1 = 15, T2 = 5, T3 = 10, and the learning weights of the image selector are α = 1, rt = 0.1. For MSMO dataset, the training epoch for three stages is T1 = 8, T2 = 2, T3 = 10, and the learning weights of the image selector are α = 0.5, rt = 0.05. The base model trains 30 and 20 epochs for MMS and MSMO, respectively. In the test phase, we employ beam search and set the beam size as 4.