A Hierarchical Network for Multimodal Document-Level Relation Extraction

Authors: Lingxing Kong, Jiuliang Wang, Zheng Ma, Qifeng Zhou, Jianbing Zhang, Liang He, Jiajun Chen

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on our proposed dataset show that 1) incorporating video information greatly improves model performance; 2) our hierarchical framework has state-of-the-art results compared with both unimodal and multimodal baselines; 3) through collaborating with video information, our model better solves the long-dependency and mention-selection problems. and We split the constructed dataset into a training set with 2300 samples, a development set with 343 samples, and a testing set with 400 samples. The hyper-parameters for all models are tuned on the development set.
Researcher Affiliation Collaboration 1 National Key Laboratory for Novel Software Technology, Nanjing University, China 2 Institute for AI Industry Research (AIR), Tsinghua University 3 School of Artificial Intelligence, Nanjing University, China and *Internship at AIR, Tsinghua University
Pseudocode No The paper describes its methods and provides architectural diagrams (e.g., Figure 2) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes We make our resources available (https://github.com/acddca/MDoc RE).
Open Datasets Yes To support this novel task, we construct a human-annotated dataset with VOA news scripts and videos. Our approach to addressing this task is based on a hierarchical network that adeptly captures and fuses multimodal features at two distinct levels. and We make our resources available (https://github.com/acddca/MDoc RE).
Dataset Splits Yes We split the constructed dataset into a training set with 2300 samples, a development set with 343 samples, and a testing set with 400 samples.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. It only mentions general experimental setup details and software.
Software Dependencies No The paper mentions using pre-trained language models like BERT and architectural components such as CNN and Transformers, but it does not specify exact version numbers for software dependencies like programming languages (e.g., Python 3.x), libraries (e.g., PyTorch 1.x), or specific frameworks.
Experiment Setup Yes We set the number of textual-guided transformer layers LN1 in the Global Encoder and LN2 in the Local Encoder to 1 and 2, respectively. We set the number of heads N to 12. The maximum sequence length for textual and visual inputs is set to 512 and 128, respectively. During training, we use a batch size of 4, a learning rate of 1e-5, and a dropout rate of 0.2.