DIUSum: Dynamic Image Utilization for Multimodal Summarization
Authors: Min Xiao, Junnan Zhu, Feifei Zhai, Yu Zhou, Chengqing Zong
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results have shown that DIUSum outperforms multiple strong baselines and achieves SOTA on two public multimodal summarization datasets. |
| Researcher Affiliation | Collaboration | 1 State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China 2 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 3 Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described. |
| Open Datasets | Yes | We experiment with the MMS (Li et al. 2018) and MSMO (Zhu et al. 2018) datasets. |
| Dataset Splits | Yes | MMS train 62,000 dev 2,000 test 2,000; MSMO train 293,964 dev 10,355 test 10,256 |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions BERT and VGG-19 as models and Bert Adam as an optimizer, but does not provide specific version numbers for software dependencies (e.g., Python 3.x, PyTorch 1.x) needed for replication. |
| Experiment Setup | Yes | The batch size is set to 8. For MMS dataset, we use the max text encoding length of 60, and the max text decoding length is 20. For MSMO dataset, we use the max text encoding length of 300, and the max text decoding length is 120. We use the Bert Adam (Kingma and Ba 2014) optimizer and set the learning rate as 1e 4, with the warmup portion as 0.1. When calculating the edit distance of the first k tokens... k is set to 5 and 8 for MMS and MSMO, respectively. For MMS dataset, the training epoch for three stages is T1 = 15, T2 = 5, T3 = 10, and the learning weights of the image selector are α = 1, rt = 0.1. For MSMO dataset, the training epoch for three stages is T1 = 8, T2 = 2, T3 = 10, and the learning weights of the image selector are α = 0.5, rt = 0.05. The base model trains 30 and 20 epochs for MMS and MSMO, respectively. In the test phase, we employ beam search and set the beam size as 4. |