VisualMRC: Machine Reading Comprehension on Document Images

Authors: Ryota Tanaka, Kyosuke Nishida, Sen Yoshida13878-13888

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with Visual MRC show that this model outperformed the base sequence-to-sequence models and a state-of-the-art VQA model.
Researcher Affiliation Industry Ryota Tanaka , Kyosuke Nishida , Sen Yoshida NTT Media Intelligence Laboratories, NTT Corporation {ryouta.tanaka.rg, kyosuke.nishida.rx, sen.yoshida.tu}@hco.ntt.co.jp
Pseudocode No The paper describes the model architecture and training process in prose and with diagrams, but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper states: "1We will release further information about this dataset at https: //github.com/nttmdlab-nlp/Visual MRC." This link is explicitly for "further information about this dataset", not the source code for the proposed methodology.
Open Datasets Yes In this study, we introduce a new visual machine reading comprehension dataset, named Visual MRC, wherein given a question and a document image, a machine reads and comprehends texts in the image to answer the question in natural language. Compared with existing visual question answering (VQA) datasets that contain texts in images, Visual MRC focuses more on developing natural language understanding and generation abilities. It contains 30,000+ pairs of a question and an abstractive answer for 10,000+ document images sourced from multiple domains of webpages. We also introduce a new model that extends existing sequence-to-sequence models, pre-trained with largescale text corpora, to take into account the visual layout and content of documents.
Dataset Splits Yes We split the dataset into training, development, and test sets, in terms of URL domain; the datasets contain 21,015, 2,839, and 6,708 questions, respectively.
Hardware Specification Yes We implemented all the models in Py Torch and experimented on eight NVIDIA Quadro RTX 8000 GPUs.
Software Dependencies No The paper mentions key software components such as "Py Torch" for implementation and "BART" and "T5" from "huggingface Transformers" as base models, as well as "Tesseract OCR system" and "coco-caption toolkit." However, specific version numbers for these software components are not provided, which is required for reproducibility.
Experiment Setup Yes The balancing parameter λsal was set to 1.0. During training, we used a batch size of 32, and trained for a maximum of seven epochs. Our model was trained using the Adam optimizer (Kingma and Ba 2015) with a learning rate of 3e-5. The best model in terms of ROUGE-L was selected using the validation set.