M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis
Authors: Ning Zhang, Hiuyi Cheng, Jiayu Chen, Zongyuan Jiang, Jun Huang, Yang Xue, Lianwen Jin
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate significant performance improvements in detectors equipped with M2Doc on datasets such as Doc Lay Net (+11.3 m AP) and M6Doc (+1.9 m AP). Furthermore, through the integration of the DINO detector with M2Doc, we achieve state-of-the-art results on Doc Lay Net (89.0 m AP), M6Doc (69.9 m AP), and Pub Lay Net (95.5 m AP). |
| Researcher Affiliation | Collaboration | 1 South China University of Technology 2 Platform of AI(PAI), Alibaba Group 3 SCUT-Zhuhai Institute of Modern Industrial Innovation, Zhuhai, China |
| Pseudocode | No | The paper describes the proposed method but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code will be publicly released at https://github.com/johnning2333/M2Doc. |
| Open Datasets | Yes | We evaluate the effectiveness of our method on three layout analysis datasets: Pub Lay Net (Zhong, Tang, and Jimeno Yepes 2019), Doc Lay Net (Pfitzmann et al. 2022), and M6Doc (Hiuyi et al. 2023). |
| Dataset Splits | No | The paper mentions training on datasets and the use of a 'test set', but does not explicitly provide details about training/validation/test dataset splits (e.g., percentages or specific sample counts for each split, or reference to predefined validation splits). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions 'MMDetection' and 'Hugging Face' but does not specify their version numbers or any other software dependencies with specific versions. |
| Experiment Setup | Yes | Cascade Mask R-CNN uses the SGD optimiser with an initialised learning rate of 2e-2 to train for 36 epochs, while learning rate decays to 2e-3 on 27th epoch and decays to 2e-4 on 33rd epoch; DINO uses the Adam W optimiser (Ilya and Frank 2019) with an initialised learning rate of 1e-4 to train for 36 epochs, while learning rate decays to 3.3e-5 on 27th epoch and decays to 1e-5 on 33rd epoch. For Pub Lay Net dataset, both Cascade Mask R-CNN and DINO training 6 epochs with the same initialized learning rate in Doc Lay Net, and both learning rates divided by 10 on 5th epoch. For the DINO, we use DINO-4Scale, and set the query numbers to 900 following the default settings of DINO. For the Cascade Mask R-CNN, we use 10 anchors [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2, 5, 10] to adapt to different scales of input. |