Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controllable Satellite-to-Street-View Synthesis with Precise Pose Alignment and Zero-Shot Environmental Control

Authors: Xianghui Ze, Zhenbo Song, Qiwei Wang, Jianfeng Lu, Yujiao Shi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive quantitative and qualitative evaluations demonstrate that our approach significantly improves pose accuracy and enhances the diversity and realism of generated streetview images, setting a new benchmark for satellite-to-street-view generation tasks.
Researcher Affiliation Academia 1Nanjing University of Science and Technology, 2Shanghai Tech University EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Iterative Homography Adjustment for Pose Refinement. Input: diffusion steps t, the noisy image zt, satellite image conditions Is and the rotation R and translation T relative to the satellite image. for all t from T to 1 do zt zt grid(H) // H is a diagonal matrix of ones zt,0, zt 1 DDIM(zt, t, R, T, Is) // eliminate noise to obtain zt,0, zt 1 H H HLpose(zt,0, R, T, Is)) // adjust the Homography matrix zt 1 zt 1 grid(H) end for
Open Source Code No The paper does not provide any explicit statement about releasing its source code, nor does it include a link to a code repository.
Open Datasets Yes Datasets. We adopt three cross-view datasets: CVUSA (Zhai et al. (2017)), KITTI (Geiger et al. (2013); Shi & Li (2022)), and VIGOR (Zhu et al. (2021); Lentsch et al. (2022)). These datasets comprise pairs of cross-view data, combining ground-level images with their corresponding satellites.
Dataset Splits Yes CVUSA comprises 35,532 pairs of satellite and street view images for training and 8,884 pairs for testing. Following the setup of the cross-view localization task (Shi & Li (2022); Xia et al. (2023)), KITTI includes 19,655 pairs in the training data and 3,773 pairs in the testing data. VIGOR gathers data from New York, Seattle, San Francisco, and Chicago, dividing the data from each city into 52,609 pairs for the training set and 52,605 pairs for the test set.
Hardware Specification No The paper mentions 'Training process on three GPUs' and 'Memory Time Cost Wo.IHA 20126MB 5.406s W.IHA 21022MB 5.513s' for computational workload analysis, but does not provide specific GPU models, CPU models, or other detailed hardware specifications used for running the experiments.
Software Dependencies Yes Our model is finetuned based on the Stable Diffusion 1.5 model (Rombach et al. (2022a)), with the Cross-Attention of diffusion replaced by Geometric Cross-Attention, and satellite image conditions processed through a simple VIT network for feature extraction.
Experiment Setup Yes Experimental setup. We take 256x256 satellite images as input to predict 128x512 ground images, following the same setup as in Shi et al. (2022a) for fair comparison. Our model is finetuned based on the Stable Diffusion 1.5 model (Rombach et al. (2022a)), with the Cross-Attention of diffusion replaced by Geometric Cross-Attention, and satellite image conditions processed through a simple VIT network for feature extraction. During inference, we employ DDIM sampling with 50 sampling steps, applying the Homography Adjustment scheme in the first 40 sampling steps and Zero-Shot Environmental Control throughout the entire sampling process. In Geometric Cross Attention, we utilize 8 sampling heights of [-3, -2, -1, 1, 2, 3, 4, 5]. This constitutes an empirical setup. Training process on three GPUs with batch size of 32 for 200 epochs.