A Hierarchical Network for Multimodal Document-Level Relation Extraction
Authors: Lingxing Kong, Jiuliang Wang, Zheng Ma, Qifeng Zhou, Jianbing Zhang, Liang He, Jiajun Chen
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on our proposed dataset show that 1) incorporating video information greatly improves model performance; 2) our hierarchical framework has state-of-the-art results compared with both unimodal and multimodal baselines; 3) through collaborating with video information, our model better solves the long-dependency and mention-selection problems. and We split the constructed dataset into a training set with 2300 samples, a development set with 343 samples, and a testing set with 400 samples. The hyper-parameters for all models are tuned on the development set. |
| Researcher Affiliation | Collaboration | 1 National Key Laboratory for Novel Software Technology, Nanjing University, China 2 Institute for AI Industry Research (AIR), Tsinghua University 3 School of Artificial Intelligence, Nanjing University, China and *Internship at AIR, Tsinghua University |
| Pseudocode | No | The paper describes its methods and provides architectural diagrams (e.g., Figure 2) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | We make our resources available (https://github.com/acddca/MDoc RE). |
| Open Datasets | Yes | To support this novel task, we construct a human-annotated dataset with VOA news scripts and videos. Our approach to addressing this task is based on a hierarchical network that adeptly captures and fuses multimodal features at two distinct levels. and We make our resources available (https://github.com/acddca/MDoc RE). |
| Dataset Splits | Yes | We split the constructed dataset into a training set with 2300 samples, a development set with 343 samples, and a testing set with 400 samples. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. It only mentions general experimental setup details and software. |
| Software Dependencies | No | The paper mentions using pre-trained language models like BERT and architectural components such as CNN and Transformers, but it does not specify exact version numbers for software dependencies like programming languages (e.g., Python 3.x), libraries (e.g., PyTorch 1.x), or specific frameworks. |
| Experiment Setup | Yes | We set the number of textual-guided transformer layers LN1 in the Global Encoder and LN2 in the Local Encoder to 1 and 2, respectively. We set the number of heads N to 12. The maximum sequence length for textual and visual inputs is set to 512 and 128, respectively. During training, we use a batch size of 4, a learning rate of 1e-5, and a dropout rate of 0.2. |