Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Chain-of-region: Visual Language Models Need Details for Diagram Analysis
Authors: Xue Li, Yiyou Sun, Wei Cheng, Yinglun Zhu, Haifeng Chen
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted extensive experiments on the Massive Multi-discipline Multimodal (MMMU, Yue et al. (2024a)) dataset, which includes a diverse array of multimodal questions sourced from college exams, quizzes, and textbooks. We validate our approach through a series of experiments that demonstrate enhanced performance in diagram analysis tasks, setting a new standard for integrating visual and language processing in a multimodal context. |
| Researcher Affiliation | Collaboration | 1University of Wisconsin-Madison, 2University of California, Berkeley, 3NEC Laboratories America 4University of California, Riverside |
| Pseudocode | Yes | 1 _, X_bi = cv2.threshold(X_im, 0, 1, cv2.THRESH_OTSU) 2 _, X_fg = cv2.connected Components(X_bi) 3 _, X_bg = cv2.connected Components(1 X_bi) 4 X_region = X_fg + X_bg + (X_bg > 0) * offset |
| Open Source Code | No | The paper discusses the use of the Open CV library and refers to third-party tools like PaddlePaddle. It mentions "our own implementations for detecting rectangular shapes" but does not provide any statement or link indicating the release of the code for their described methodology. |
| Open Datasets | Yes | We conducted extensive experiments on the Massive Multi-discipline Multimodal (MMMU, Yue et al. (2024a)) dataset, which includes a diverse array of multimodal questions sourced from college exams, quizzes, and textbooks. |
| Dataset Splits | No | The paper mentions using a tailored subset of the MMMU dataset comprising 5,210 images and constructing a custom dataset of 100+ samples for segmentation evaluation. However, it does not provide specific training, validation, or test splits for either dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or other computer specifications used for running the experiments. It only mentions that the Co R framework itself "requires only CPU processing". |
| Software Dependencies | Yes | Specifically, we utilize gpt-4-turbo, chatgpt-4o-latest, and gpt-4o-mini-2024-07-18 respectively. |
| Experiment Setup | Yes | The primary hyperparameters employed in our Chain-of-Region method include the pre-defined recognition call limits in split step and the cluster number during the unstructured merge step, detailed in Sections 3.1.2 and 3.1.3. In the current implementation, we have set the recognition call limit to 10 and the cluster number to 5. |