Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Authors: Ruihao Xia, Yu Liang, Peng-Tao Jiang, Hao Zhang, Bo Li, Yang Tang, Pan Zhou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
Researcher Affiliation Collaboration Ruihao Xia1,2 , Yu Liang2 , Peng-Tao Jiang2 , Hao Zhang2 Bo Li2 , Yang Tang1,3 , Pan Zhou4 1East China University of Science and Technology, 2vivo Mobile Communication Co., Ltd 3Peng Cheng Laboratory, 4Singapore Management University
Pseudocode No The paper provides a framework diagram (Figure 2) but does not include a formal pseudocode or algorithm block.
Open Source Code Yes We open-source our code and models at https://github.com/Xia-Rho/MADM.
Open Datasets Yes In our experiments, we adopt the Cityscapes-Image [13] dataset as the source modality and the DELIVER-Depth [5], FMB-Infrared [6], and DSEC-Event [7] datasets as the target modalities.
Dataset Splits Yes Cityscapes [13] is the source dataset in our experiments... split into 2,975 training images and 500 validation images... DELIVER [5]... contains 3,983/2,005/1,897 samples for training/validation/testing...
Hardware Specification Yes Experiments are conducted on a NVIDIA H800 GPU, occupying about 57G memory.
Software Dependencies No The paper mentions using the Stable Diffusion v1-4 model and DAFormer components but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We train our MADM for 10k iterations with a batch size of 2 and an image resolution of 512 × 512. The optimization is instantiated with Adam W [45] with a learning rate of 5e-6. For hyperparameters β, γ, and λreg in DPLG and LPLR, we set them to {5000,60,1.0}/{8000,50,1.0}/{8000,50,10.0} for depth/infrared/event modalities, respectively.