Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains

Authors: Chun Wang, Xiaojun Ye, Xiaoran Pan, Zihao Pan, Haofan Wang, Yiren Song

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that GRE significantly outperforms existing methods across all granularities of geo-localization tasks, underscoring the efficacy of reasoning-augmented VLMs in complex geographic inference.
Researcher Affiliation	Collaboration	Chun Wang1,2 Xiaojun Ye1 Xiaoran Pan1 Zihao Pan3 Haofan Wang4 Yiren Song2,5 1Zhejiang University 2Creatly.ai 3Sun Yat-sen University 4Lib Lib.ai 5NUS EMAIL
Pseudocode	No	The paper describes methods and pipelines using diagrams (e.g., Figure 3, Figure 11, Figure 12) and mathematical equations (e.g., in Section 3.3), but does not contain any explicitly structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data will be released at https://github.com/Thorin215/GRE.
Open Datasets	Yes	First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis. ... Code and data will be released at https://github.com/Thorin215/GRE. ... We make full use of the publicly available dataset MP16-Pro [21] with GPS coordinates. ... We test our trained model on Im2GPS3k [13] and Google World Streets 15k (GWS15k) [8]. ... We also have compared our model on the OSV-5M [2] in Table 8, where our model emonstrates excellent performance.
Dataset Splits	Yes	We randomly sample 5% of MP-16 [24], a dataset containing 4.72 million geotagged images from Flickr 3, as geography seed datasets to construct our GRE30K. This dataset is strategically utilized across our three-stage training process: GRE30K-Co T, comprising 20k high-quality Chain-of-Thought examples curated by geography experts and standardized in format, serves for cold-start initialization; GRE30K-Judge, consisting of 10k Co T judgment tasks, is employed for Stage I reinforcement learning training and the remaining 170k seed datasets are utilized for Stage II reinforcement learning training.
Hardware Specification	Yes	All experiments are conducted with Py Torch and 8 NVIDIA H20(96G) GPUs.
Software Dependencies	No	The paper mentions "Py Torch" but does not specify a version number or other key software components with their versions.
Experiment Setup	Yes	We adopt Qwen2.5-VL-7B as base model, the SFT experiments are conducted with a batch size of 128, a learning rate of 1e-5, and training over 1 epochs.