Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models

Authors: Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, Jiaheng Wei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct both qualitative and quantitative experiments, including ablation studies, to evaluate the effectiveness of our curated dataset MP16-Reason and the GRPO-based training strategy employed in GLOBE. Both qualitative and quantitative results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories.
Researcher Affiliation	Academia	1The Hong Kong University of Science and Technology (Guangzhou) 2The Hong Kong University of Science and Technology 3Independent Researcher
Pseudocode	No	The paper describes the methodology in detail, including dataset curation, reward construction, and model fine-tuning. However, it does not present any explicitly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code	Yes	The data and code are available at https://github.com/lingli1996/GLOBE.
Open Datasets	Yes	To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16Reason, using diverse social media images. We introduce GLOBE... The data and code are available at https://github.com/lingli1996/GLOBE. We further evaluate all models on the public geo-localization benchmark IM2GPS3K [82] and OSV-5M [83].
Dataset Splits	Yes	The curated dataset MP16-Reason is divided into two subsets: MP16-Reason-Train with 33k samples and MP16-Reason-Test with 12k samples, respectively. MP16-Reason-Train is used to train GLOBE, while MP16-Reason-Test is used to evaluate all baseline methods.
Hardware Specification	Yes	For data curation, we deployed Qwen2.5-VL-72B and Intern VL3-78B using 8 H20 GPUs under the VLLM framework, while Geo CLIP was run separately on a single H20 GPU. ...In GRPO training, the 7B model was trained on 8 H20 GPUs with a batch size of 16, yielding a throughput of approximately 0.44 examples per second.
Software Dependencies	No	The paper mentions the use of 'VLLM framework', 'Py Torch', and 'Adam W' optimizer but does not specify exact version numbers for these software components. For example, it lists 'VLLM framework' and 'Py Torch' in the context of hardware setup but without version details.
Experiment Setup	Yes	We summarize the key hyper-parameters used in training GLOBE in Table 6. These settings are selected based on standard practices in fine-tuning large vision-language models and further adjusted through preliminary ablation studies on a held-out validation set. Table 6 includes: Learning Rate 1e-6, Total Batch Size 16, Weight Decay 0.1, Warmup Ratio 0.01, Optimizer Adam W, Adam Beta1 0.9, Adam Beta2 0.95, LR Scheduler cosine, Model Max Length 8192.