reproducibilityindex.ai

DIUSum: Dynamic Image Utilization for Multimodal Summarization

Authors: Min Xiao, Junnan Zhu, Feifei Zhai, Yu Zhou, Chengqing Zong

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results have shown that DIUSum outperforms multiple strong baselines and achieves SOTA on two public multimodal summarization datasets.
Researcher Affiliation	Collaboration	1 State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China 2 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 3 Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing, China
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described.
Open Datasets	Yes	We experiment with the MMS (Li et al. 2018) and MSMO (Zhu et al. 2018) datasets.
Dataset Splits	Yes	MMS train 62,000 dev 2,000 test 2,000; MSMO train 293,964 dev 10,355 test 10,256
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions BERT and VGG-19 as models and Bert Adam as an optimizer, but does not provide specific version numbers for software dependencies (e.g., Python 3.x, PyTorch 1.x) needed for replication.
Experiment Setup	Yes	The batch size is set to 8. For MMS dataset, we use the max text encoding length of 60, and the max text decoding length is 20. For MSMO dataset, we use the max text encoding length of 300, and the max text decoding length is 120. We use the Bert Adam (Kingma and Ba 2014) optimizer and set the learning rate as 1e 4, with the warmup portion as 0.1. When calculating the edit distance of the first k tokens... k is set to 5 and 8 for MMS and MSMO, respectively. For MMS dataset, the training epoch for three stages is T1 = 15, T2 = 5, T3 = 10, and the learning weights of the image selector are α = 1, rt = 0.1. For MSMO dataset, the training epoch for three stages is T1 = 8, T2 = 2, T3 = 10, and the learning weights of the image selector are α = 0.5, rt = 0.05. The base model trains 30 and 20 epochs for MMS and MSMO, respectively. In the test phase, we employ beam search and set the beam size as 4.