Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Primitive Vision: Improving Diagram Understanding in MLLMs

Authors: Shan Zhang, Aotian Chen, Yanpeng Sun, Jindong Gu, Yi-Yu Zheng, Piotr Koniusz, Kai Zou, Anton Van Den Hengel, Yuan Xue

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our systematic evaluation on the visual grounding capabilities of state-of-the-art MLLMs highlights that fine-grained visual understanding remains a crucial bottleneck in visual mathematical reasoning (GPT-4o exhibits a 70% grounding error rate, and correcting these errors improves reasoning accuracy by 12%). We thus propose a novel approach featuring a geometrically-grounded vision encoder and a feature router that dynamically selects between hierarchical visual feature maps. Our model accurately recognizes visual primitives and generates precise visual prompts aligned with the language model s reasoning needs. In experiments, PRIMITIVE-Qwen2.5-7B outperforms other 7B models by 12% on Math Verse and is on par with GPT-4V on Math Vista.
Researcher Affiliation Collaboration 1Australian Institute for Machine Learning 2Data61 CSIRO 3The Ohio State University 4National University of Singapore 5University of Oxford 6Net Mind.ai 7Australian National University 8The Commonwealth Bank of Australia.
Pseudocode No The paper describes its methodology using text and diagrams (Fig. 2, Fig. 8, Fig. 9) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at github.com/AI4Math-Shan Zhang/SVE-Math.
Open Datasets Yes We systematically analyzed MLLMs ability to describe geometric entities using a meticulously collected set of 100 images from the Geo170K dataset (Gao et al., 2023a). We evaluate PRIMITIVE on several public mathematical benchmarks... Math Verse (Zhang et al., 2024a), Geo QA (Gao et al., 2023a), and Math Vista (Lu et al., 2023). We incorporated 20,672 images from the Figure QA training dataset with bounding box annotations for the shape grounding task. For fair comparison, we train our model on Math V360k (Shi et al., 2024) using a batch size of 16 for one epoch, evaluating on Math Vista (Lu et al., 2023) and the minitest set of Math Verse (Zhang et al., 2024a).
Dataset Splits Yes We train PRIMITIVE for one epoch for crossmodal alignment and two epochs for instruction tuning on Geo170K (Gao et al., 2023a), evaluating on Geo QA. For fair comparison, we train on Math V360k (Shi et al., 2024) with a batch size of 16 for one epoch, evaluating on Math Vista (Lu et al., 2023) and Math Verse (Zhang et al., 2024a). When tested on the Geo170K test set of the Geo QA benchmark, the top-1 accuracy dropped from 67.0% to 63.2%.
Hardware Specification Yes Training is conducted on 8 A100 GPUs with a batch size of 32. Training is conducted on 8 A100 GPUs with a batch size of 128 using the Math V360K dataset, which includes 40K images and 360K question-answer pairs.
Software Dependencies No The paper mentions several software components like "Matplotlib Python library", "LLa VA-1.5 architecture", "Deep Seek Math-7B-Instruct", "Qwen2.5-Math-7B-Instruct", "LLa MA-2", "CLIP Vi T-L", and "GLIP-T model (with Swin-Tiny as the backbone)", but does not provide specific version numbers for any of them.
Experiment Setup Yes Training is conducted on 8 A100 GPUs with a batch size of 32. The base learning rate is set to 1e-5 for the language backbone and 1e-4 for all other parameters, and it is decreased by a factor of 0.1 at 67% and 89% of the total training steps. We train PRIMITIVE for one epoch for cross-modal alignment and two epochs for instruction tuning on the Geo170K(Gao et al., 2023a) dataset, evaluating the model on Geo QA (Gao et al., 2023a). For fair comparison, we train our model on Math V360k (Shi et al., 2024) using a batch size of 16 for one epoch with an initial learning rate of 3e-5, evaluating on Math Vista (Lu et al., 2023) and the minitest set of Math Verse (Zhang et al., 2024a).