Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Controllable Satellite-to-Street-View Synthesis with Precise Pose Alignment and Zero-Shot Environmental Control

Authors: Xianghui Ze, Zhenbo Song, Qiwei Wang, Jianfeng Lu, Yujiao Shi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive quantitative and qualitative evaluations demonstrate that our approach significantly improves pose accuracy and enhances the diversity and realism of generated streetview images, setting a new benchmark for satellite-to-street-view generation tasks.
Researcher Affiliation	Academia	1Nanjing University of Science and Technology, 2Shanghai Tech University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Iterative Homography Adjustment for Pose Refinement. Input: diffusion steps t, the noisy image zt, satellite image conditions Is and the rotation R and translation T relative to the satellite image. for all t from T to 1 do zt zt grid(H) // H is a diagonal matrix of ones zt,0, zt 1 DDIM(zt, t, R, T, Is) // eliminate noise to obtain zt,0, zt 1 H H HLpose(zt,0, R, T, Is)) // adjust the Homography matrix zt 1 zt 1 grid(H) end for
Open Source Code	No	The paper does not provide any explicit statement about releasing its source code, nor does it include a link to a code repository.
Open Datasets	Yes	Datasets. We adopt three cross-view datasets: CVUSA (Zhai et al. (2017)), KITTI (Geiger et al. (2013); Shi & Li (2022)), and VIGOR (Zhu et al. (2021); Lentsch et al. (2022)). These datasets comprise pairs of cross-view data, combining ground-level images with their corresponding satellites.
Dataset Splits	Yes	CVUSA comprises 35,532 pairs of satellite and street view images for training and 8,884 pairs for testing. Following the setup of the cross-view localization task (Shi & Li (2022); Xia et al. (2023)), KITTI includes 19,655 pairs in the training data and 3,773 pairs in the testing data. VIGOR gathers data from New York, Seattle, San Francisco, and Chicago, dividing the data from each city into 52,609 pairs for the training set and 52,605 pairs for the test set.
Hardware Specification	No	The paper mentions 'Training process on three GPUs' and 'Memory Time Cost Wo.IHA 20126MB 5.406s W.IHA 21022MB 5.513s' for computational workload analysis, but does not provide specific GPU models, CPU models, or other detailed hardware specifications used for running the experiments.
Software Dependencies	Yes	Our model is finetuned based on the Stable Diffusion 1.5 model (Rombach et al. (2022a)), with the Cross-Attention of diffusion replaced by Geometric Cross-Attention, and satellite image conditions processed through a simple VIT network for feature extraction.
Experiment Setup	Yes	Experimental setup. We take 256x256 satellite images as input to predict 128x512 ground images, following the same setup as in Shi et al. (2022a) for fair comparison. Our model is finetuned based on the Stable Diffusion 1.5 model (Rombach et al. (2022a)), with the Cross-Attention of diffusion replaced by Geometric Cross-Attention, and satellite image conditions processed through a simple VIT network for feature extraction. During inference, we employ DDIM sampling with 50 sampling steps, applying the Homography Adjustment scheme in the first 40 sampling steps and Zero-Shot Environmental Control throughout the entire sampling process. In Geometric Cross Attention, we utilize 8 sampling heights of [-3, -2, -1, 1, 2, 3, 4, 5]. This constitutes an empirical setup. Training process on three GPUs with batch size of 32 for 200 epochs.