Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Chain-of-region: Visual Language Models Need Details for Diagram Analysis

Authors: Xue Li, Yiyou Sun, Wei Cheng, Yinglun Zhu, Haifeng Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted extensive experiments on the Massive Multi-discipline Multimodal (MMMU, Yue et al. (2024a)) dataset, which includes a diverse array of multimodal questions sourced from college exams, quizzes, and textbooks. We validate our approach through a series of experiments that demonstrate enhanced performance in diagram analysis tasks, setting a new standard for integrating visual and language processing in a multimodal context.
Researcher Affiliation Collaboration 1University of Wisconsin-Madison, 2University of California, Berkeley, 3NEC Laboratories America 4University of California, Riverside
Pseudocode Yes 1 _, X_bi = cv2.threshold(X_im, 0, 1, cv2.THRESH_OTSU) 2 _, X_fg = cv2.connected Components(X_bi) 3 _, X_bg = cv2.connected Components(1 X_bi) 4 X_region = X_fg + X_bg + (X_bg > 0) * offset
Open Source Code No The paper discusses the use of the Open CV library and refers to third-party tools like PaddlePaddle. It mentions "our own implementations for detecting rectangular shapes" but does not provide any statement or link indicating the release of the code for their described methodology.
Open Datasets Yes We conducted extensive experiments on the Massive Multi-discipline Multimodal (MMMU, Yue et al. (2024a)) dataset, which includes a diverse array of multimodal questions sourced from college exams, quizzes, and textbooks.
Dataset Splits No The paper mentions using a tailored subset of the MMMU dataset comprising 5,210 images and constructing a custom dataset of 100+ samples for segmentation evaluation. However, it does not provide specific training, validation, or test splits for either dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or other computer specifications used for running the experiments. It only mentions that the Co R framework itself "requires only CPU processing".
Software Dependencies Yes Specifically, we utilize gpt-4-turbo, chatgpt-4o-latest, and gpt-4o-mini-2024-07-18 respectively.
Experiment Setup Yes The primary hyperparameters employed in our Chain-of-Region method include the pre-defined recognition call limits in split step and the cluster number during the unstructured merge step, detailed in Sections 3.1.2 and 3.1.3. In the current implementation, we have set the recognition call limit to 10 and the cluster number to 5.