Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CVGL: Causal Learning and Geometric Topology

Authors: Songsong Ouyang, Yingying Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on CVUSA, CVACT, and their robustness-enhanced variants (CVUSA-C-ALL and CVACT-C-ALL) demonstrate that CLGT achieves state-of-the-art performance, particularly under challenging real-world corruptions.
Researcher Affiliation	Academia	Songsong Ouyang Yingying Zhu College of Computer Science and Software Engineering Shenzhen University EMAIL, EMAIL
Pseudocode	No	The paper describes methods through textual descriptions and mathematical formulas (Equations 1-9) and provides architectural diagrams (Figures 3, 4, 5), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Our codes are available at CLGT.
Open Datasets	Yes	We evaluate our model on three widely-used cross-view geo-localization benchmarks CVUSA [22], CVACT [8], and VIGOR [36] as well as their robust variants: CVACT_val-C-ALL, CVACT_test-C-ALL, and CVUSA-C-ALL [32], which introduce various real-world corruptions to test model robustness under challenging conditions.
Dataset Splits	Yes	CVUSA and CVACT each provide 35,532 training and 8,884 testing image pairs with a strict 1-to-1 ground-to-aerial correspondence. In addition, CVACT offers an extra 92,802 GPS-tagged query images for large-scale retrieval evaluation, making it suitable for both standard and large-scale testing scenarios. VIGOR is a more challenging benchmark that spans four metropolitan areas New York, Seattle, San Francisco, and Chicago and includes 105,214 query and 90,618 reference images.
Hardware Specification	Yes	The training is conducted on eight 32GB NVIDIA V100 GPUs.
Software Dependencies	No	The paper mentions using Adam W for optimization but does not specify version numbers for any software libraries, frameworks, or programming languages used for implementation. It states 'Other training settings follow those used in Sample4Geo' but these details are not provided within the paper.
Experiment Setup	Yes	The model is optimized using Adam W with an initial learning rate of 0.5 10 3. We train the network for 40 epochs with a batch size of 128. The training is conducted on eight 32GB NVIDIA V100 GPUs. For both α and γ in Equation 9, we set their values to 0.1 to provide auxiliary supervision without overwhelming the main optimization objective. When we increase the value of γ, the model performance improves across various datasets. However, to prevent the value from becoming too large and causing model collapse, which would negatively affect the matching between street and aerial images, we set a default value of 0.1, although this collapse was not observed during training. The optimal value is 0.5, and we will also provide hyperparameter experiments and model performance with γ = 0.5 in the supplementary materials. We set the initial three radii for the content-aware mask to 0.1, 0.3, and 0.6, respectively. We also observe that performance is stable under small variations in the initial radius.