Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning a Cross-Modal Schrödinger Bridge for Visual Domain Generalization

Authors: Hao Zheng, Jingjun Yi, Qi Bi, Huimin Huang, Haolan Zhan, Yawen Huang, Yuexiang Li, Xian Wu, Yefeng Zheng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, SBGen achieves state-of-the-art performance on domain generalization in both classification and segmentation. This work highlights the importance of modeling domain shifts as structured stochastic processes grounded in semantic alignment.
Researcher Affiliation	Collaboration	1Tencent Jarvis Lab, China, 2Westlake University, China 3University of Alberta, Canada, 4University of Amsterdam, the Netherland 5Monash University, Australia, 6University of Macau, Macau
Pseudocode	Yes	A pseudo-code implementation of the proposed SBGen is given in Algorithm 1.
Open Source Code	No	The datasets this paper uses are publicly available, and the source code is promised to be public once published.
Open Datasets	Yes	PACS [41], VLCS [24], Office Home [71], Terra Incognita [5], and Domain Net [58] comprise of 9,991, 10,729, 15,588, 24,330 and 0.6 million images from four, four, four and six domains, respectively. In line with prior work [29, 12], the leave-one-domain-out evaluation protocol is adopted, where one domain is held out as the unseen target domain, while the remaining domains are used for training the model. Performance is reported using classification accuracy (percentage, %) as the evaluation metric. Four driving-scene semantic segmentation datasets that share 19 common scene categories are used for validation. Specifically, City Scapes (C) [18] consists of 2,975 and 500 images for training and validation, respectively. The images were captured under the clear conditions in tens of Germany cities. BDD-100K (B) [78] has 7,000 and 1,000 images for training and validation, respectively. The images were captured under diverse conditions from a variety of global cities. Mapillary (M) [47] is a large-scale semantic segmentation dataset, which consists of 25,000 images from diverse conditions. GTA5 (G) [61] is another synthetic dataset, which has 24,966 simulated images from the American street landscape.
Dataset Splits	Yes	In line with prior work [29, 12], the leave-one-domain-out evaluation protocol is adopted, where one domain is held out as the unseen target domain, while the remaining domains are used for training the model. City Scapes (C) [18] consists of 2,975 and 500 images for training and validation, respectively. ... BDD-100K (B) [78] has 7,000 and 1,000 images for training and validation, respectively. ... Following the evaluation protocol of existing foundation model based DGSS methods [74, 52], two commonly-used evaluation settings are: 1) G C, B, M; and 2) C B, M, respectively.
Hardware Specification	Yes	The GPU hour refers to one single A100 GPU hardware.
Software Dependencies	No	Following prior work [52], the same training configuration is set for all types of pre-trained foundation models (e.g., CLIP, DINOv2, and EVA02), and for both domain generalization in classification and semantic segmentation. The task-specific decoder D integrates the pixel decoder of the Mask2Former model [14].
Experiment Setup	Yes	In all the experiments, the images are cropped and resized into 512 512 pixels. The batch size is set 16, with an Adam W optimizer. The initial learning rate is set to be 1 10 5 for all the synthetic-to-real settings, and is set to be 1 10 4 for all the real-to-real settings. The learning rate of the backbone is further scaled by 0.1. The training does not terminate after 20,000 iterations. Following [52], a linear warm-up is applied after 1500 iterations, followed by a linear decay. Some common data augmentation techniques, namely, random scaling, random cropping, random flipping, color jittering, and rare class sampling, are also used. Table 6: Impact of time step T. By default, T is set to 5 under all of our experiments. Table 9: Impact of hyper-parameter λ. ... when λ is set to 1, the generalization performance achieves the optimal. Table 10: Impact of hyper-parameter K. By default, K is set to be 0.3 under all of our experiments.