reproducibilityindex.ai

M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis

Authors: Ning Zhang, Hiuyi Cheng, Jiayu Chen, Zongyuan Jiang, Jun Huang, Yang Xue, Lianwen Jin

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results demonstrate significant performance improvements in detectors equipped with M2Doc on datasets such as Doc Lay Net (+11.3 m AP) and M6Doc (+1.9 m AP). Furthermore, through the integration of the DINO detector with M2Doc, we achieve state-of-the-art results on Doc Lay Net (89.0 m AP), M6Doc (69.9 m AP), and Pub Lay Net (95.5 m AP).
Researcher Affiliation	Collaboration	1 South China University of Technology 2 Platform of AI(PAI), Alibaba Group 3 SCUT-Zhuhai Institute of Modern Industrial Innovation, Zhuhai, China
Pseudocode	No	The paper describes the proposed method but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code will be publicly released at https://github.com/johnning2333/M2Doc.
Open Datasets	Yes	We evaluate the effectiveness of our method on three layout analysis datasets: Pub Lay Net (Zhong, Tang, and Jimeno Yepes 2019), Doc Lay Net (Pfitzmann et al. 2022), and M6Doc (Hiuyi et al. 2023).
Dataset Splits	No	The paper mentions training on datasets and the use of a 'test set', but does not explicitly provide details about training/validation/test dataset splits (e.g., percentages or specific sample counts for each split, or reference to predefined validation splits).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper mentions 'MMDetection' and 'Hugging Face' but does not specify their version numbers or any other software dependencies with specific versions.
Experiment Setup	Yes	Cascade Mask R-CNN uses the SGD optimiser with an initialised learning rate of 2e-2 to train for 36 epochs, while learning rate decays to 2e-3 on 27th epoch and decays to 2e-4 on 33rd epoch; DINO uses the Adam W optimiser (Ilya and Frank 2019) with an initialised learning rate of 1e-4 to train for 36 epochs, while learning rate decays to 3.3e-5 on 27th epoch and decays to 1e-5 on 33rd epoch. For Pub Lay Net dataset, both Cascade Mask R-CNN and DINO training 6 epochs with the same initialized learning rate in Doc Lay Net, and both learning rates divided by 10 on 5th epoch. For the DINO, we use DINO-4Scale, and set the query numbers to 900 following the default settings of DINO. For the Cascade Mask R-CNN, we use 10 anchors [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2, 5, 10] to adapt to different scales of input.