DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding

Authors: Hang Du, Guoshun Nan, Sicheng Zhang, Binzhu Xie, Junrui Xu, Hehe Fan, Qimei Cui, Xiaofeng Tao, Xudong Jiang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate the effectiveness of our method, showing that it can serve as a baseline approach to the challenging Doc MSU. We conduct extensive experiments on our Doc MSU. Results show that the created benchmark enables us to develop and evaluate various deep learning methods for the task of MSU closer to the real-world application.
Researcher Affiliation Academia Hang Du1, Guoshun Nan1*, Sicheng Zhang1, Binzhu Xie1, Junrui Xu1, Hehe Fan2, Qimei Cui1, Xiaofeng Tao1, Xudong Jiang3 1Beijing University of Posts and Telecommunications, China 2Zhejiang University 3Nanyang Technological University, Singapore
Pseudocode No The paper describes the model architecture and components but does not provide pseudocode or algorithm blocks.
Open Source Code No The paper mentions using 'an open-source tool doccano' for annotation but does not provide a link to its own source code or state that it will be released.
Open Datasets No The paper describes the creation of the 'Doc MSU' dataset and its sources (New York Times, UN News, The Onion, News Thumb) but does not provide a direct link, DOI, or repository for accessing the dataset itself in the provided text.
Dataset Splits Yes The dataset is randomly split into 70%, 20%, and 10% for training, validation, and testing.
Hardware Specification Yes We train our model with a single NVIDIA RTX 3090 GPU.
Software Dependencies No The paper mentions several software components like BERT, ResNet, Swin-Transformer, and YoloX, but does not provide specific version numbers for them or any other relevant software dependencies for reproducibility.
Experiment Setup Yes The learning rate is set to 0.001 and 0.01 for sarcasm detection and localization, respectively. We employ Adam W (Kingma and Ba 2014) as the optimizer. We employ the pre-trained uncased BERT-base (Devlin et al. 2019) as the text encoder. For the baseline, we configure Swin-Transformer with the Tiny setting for sarcasm localization, and Base for the detection task.