M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis

Authors: Ning Zhang, Hiuyi Cheng, Jiayu Chen, Zongyuan Jiang, Jun Huang, Yang Xue, Lianwen Jin

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate significant performance improvements in detectors equipped with M2Doc on datasets such as Doc Lay Net (+11.3 m AP) and M6Doc (+1.9 m AP). Furthermore, through the integration of the DINO detector with M2Doc, we achieve state-of-the-art results on Doc Lay Net (89.0 m AP), M6Doc (69.9 m AP), and Pub Lay Net (95.5 m AP).
Researcher Affiliation Collaboration 1 South China University of Technology 2 Platform of AI(PAI), Alibaba Group 3 SCUT-Zhuhai Institute of Modern Industrial Innovation, Zhuhai, China
Pseudocode No The paper describes the proposed method but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes The code will be publicly released at https://github.com/johnning2333/M2Doc.
Open Datasets Yes We evaluate the effectiveness of our method on three layout analysis datasets: Pub Lay Net (Zhong, Tang, and Jimeno Yepes 2019), Doc Lay Net (Pfitzmann et al. 2022), and M6Doc (Hiuyi et al. 2023).
Dataset Splits No The paper mentions training on datasets and the use of a 'test set', but does not explicitly provide details about training/validation/test dataset splits (e.g., percentages or specific sample counts for each split, or reference to predefined validation splits).
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions 'MMDetection' and 'Hugging Face' but does not specify their version numbers or any other software dependencies with specific versions.
Experiment Setup Yes Cascade Mask R-CNN uses the SGD optimiser with an initialised learning rate of 2e-2 to train for 36 epochs, while learning rate decays to 2e-3 on 27th epoch and decays to 2e-4 on 33rd epoch; DINO uses the Adam W optimiser (Ilya and Frank 2019) with an initialised learning rate of 1e-4 to train for 36 epochs, while learning rate decays to 3.3e-5 on 27th epoch and decays to 1e-5 on 33rd epoch. For Pub Lay Net dataset, both Cascade Mask R-CNN and DINO training 6 epochs with the same initialized learning rate in Doc Lay Net, and both learning rates divided by 10 on 5th epoch. For the DINO, we use DINO-4Scale, and set the query numbers to 900 following the default settings of DINO. For the Cascade Mask R-CNN, we use 10 anchors [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2, 5, 10] to adapt to different scales of input.