Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CroPe: Cross-Modal Semantic Compensation Adaptation for All Adverse Scene Understanding
Authors: Qin Xu, Qihang Wu, Lu Hongtao, Xiaoxia Cheng, Bo Jiang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results show that our method achieves state-of-the-art performance in various adverse scenarios, including rain, snow, fog, and nighttime, while also reducing training cost. This highlights the model s superiority in both effectiveness and efficiency. In this section, we first provide a detailed description of the experimental settings, including the datasets and implementation details, in 4.1. Subsequently, we present the main experimental results of the model in 4.2. Furthermore, in 4.3, we conduct comprehensive ablation studies to further validate the effectiveness of the Cro Pe. |
| Researcher Affiliation | Academia | Qin Xu1,2, Qihang Wu1,2, Hongtao Lu1,2, Xiaoxia Cheng1,3 , Bo Jiang1,2 1School of Computer Science & Technology, Anhui University 2Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University 3College of Computer Science & Technology, Zhejiang University EMAIL, EMAIL EMAIL, EMAIL |
| Pseudocode | Yes | A.2 Network Architecture Details ... The complete training process of our Cro Pe is illustrated in Algorithm 1. Algorithm 1 The core algorithm in Cro Pe |
| Open Source Code | Yes | Project website: https://github.com/wqh011128/Cro Pe |
| Open Datasets | Yes | Datasets: To demonstrate the effectiveness of our proposed Cro Pe method, we conduct experiments across all adverse scenes in seven real-world datasets, including Cityscapes (CS)[23], ACDC[24], Dark Zurich (DZ)[25], Nighttime Driving (ND)[26], BDD100K-Night (BD)[27], Foggy Zurich (FZ)[28] and Foggy Driving (FD) [29]. Detailed dataset information, including adverse scene types, data splits, and statistics, can be found in Appendix A.3. |
| Dataset Splits | Yes | Datasets: ... Cityscapes (CS)[23], containing 2,975 training images, 500 validation images, and 1,525 test images. ACDC contains four adverse scenes: fog, rain, snow, and night. For each scene, there are 400 training images, 100 validation images (including 106 nighttime images), and 500 test images. ... Dark Zurich (DZ) provides 8,779 images captured during nighttime, twilight, and daytime, with 50 validation and 151 test images. Nighttime Driving (ND) includes 50 coarsely annotated nighttime images specifically designed for testing. BDD100K-Night (BD), a subset of the BDD100K segmentation dataset, consists of 87 finely annotated nighttime images. Foggy Zurich (FZ) contains 3,808 images with light and medium fog, and 40 images for testing. Foggy Driving (FD) provides 101 annotated images purely for testing. For more structured statistics, see Table 9. |
| Hardware Specification | Yes | We conduct all experiments on a single RTX4090. |
| Software Dependencies | No | The paper mentions using CLIP (-B/16 and -L/14 [30]) as the backbone and Adam W optimizer, but does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch, CUDA) required for reproduction. |
| Experiment Setup | Yes | Implementation Details: Following the prevailing method DAFormer, we adopt CLIP (-B/16 and -L/14 [30]) as the backbone. During training, we use a resolution of 512 512, rather than the high resolution of 1024 1024 employed by SOTA methods, and omit the FD loss typically used. The initial learning rate for the Adam W optimizer is set to 6e-5, and the learning rates for the encoder, RCTVF module, and segmentation head are 6e-5 scaled by 1 10 , 10 , 10 , respectively. Additionally, the context length of the text prompt M is fixed to 5. The attention layers N are set to 6, with two layers computed at each scale. The weight parameter λ is set to 2.0. We conduct training experiments for 40,000 iterations. |