Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation
Authors: Bo-Wen Yin, Jiao-Long Cao, Xuying Zhang, Yuming Chen, Ming-Ming Cheng, Qibin Hou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the effectiveness of our Omni Segmentor, we conduct extensive experiments on six popular multi-modal segmentation datasets, including NYU Depthv2 [48], SUNRGBD [49], MFNet [24], KITTI-360 [34], Event Scape [18], and De Li VER [72]. The experiments are conducted on NVIDIA A40 GPUs. The models are optimized using the cross-entropy loss function and the Adam W [30] method, where the learning rate is initialized to 6e-5 and scheduled by the poly strategy. The images are augmented by random resize with a ratio of 0.5 to 1.75, random horizontal flipping, and random crop. More details, e.g., pretraining settings, are in supplementary materials. Following DFormer [66], we adopts the light decoder head [19] by default. More experimental details are in the supplementary materials. |
| Researcher Affiliation | Academia | 1NKIARI, Shenzhen Futian 2VCIP, College of Computer Science, Nankai University Corresponding author EMAIL |
| Pseudocode | No | The paper describes its methods and architectures through descriptive text and figures (e.g., Figure 3, Figure 5), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Data, model checkpoints, and source code will be made publicly available: https://github.com/VCIP-RGBD/DFormer. |
| Open Datasets | Yes | Based on Image Net, we assemble a large-scale dataset for multi-modal pretraining, called Image Ne Xt, which contains five popular visual modalities; ... This dataset is built upon Image Net [44] and supplements each RGB image with four additional visual modalities, i.e., depth, thermal, Li DAR, and event. ... Extensive experiment results demonstrate the effectiveness of Omni Segmentor on the benchmarks of a wide range of multi-modal semantic segmentation tasks, including NYU Depthv2 [48], Event Scape [18], MFNet [24], De Li VER [72], SUNRGBD [49], and KITTI-360 [34]. |
| Dataset Splits | Yes | NYU Depthv2 (RGB-D) [48] contains 1,449 RGB-D images with a size of 640 480, which is divided into 795 training and 654 test images with annotations for 40 categories. ... MFNet (RGB-T) [24] is a multi-spectral RGB-T image dataset, which has 1,569 images. 784/392/393 samples are used for training/validation/test, respectively, annotated in 8 classes at the resolution of 640 480. ... De Li VER [72] is a large-scale multimodal segmentation dataset, which is also generated by the CARLA simulator. This dataset contains 7,885 front-view samples divided into 3,983 / 2,005 / 1,897 for training / validation / test, respectively. |
| Hardware Specification | Yes | The experiments are conducted on NVIDIA A40 GPUs. ... The inference time, i.e., frames per second (FPS), is calculated on a single NVIDIA A40 GPU. |
| Software Dependencies | No | The paper mentions optimizers like Adam W [30] and frameworks like DFormer [66], but does not provide specific version numbers for any software libraries or programming languages used. |
| Experiment Setup | Yes | The models are optimized using the cross-entropy loss function and the Adam W [30] method, where the learning rate is initialized to 6e-5 and scheduled by the poly strategy. The images are augmented by random resize with a ratio of 0.5 to 1.75, random horizontal flipping, and random crop. ... Pretraining details: The multi-modal features from the last stage are flattened along the spatial dimension and fed into the linear projection to obtain the category probabilities, which are used to calculate the classification loss, i.e., the standard cross-entropy loss. To verify the universality and robstness of our methods on different architectures, we adopt DFormer-L [66], Mi T-B2 [62], and Res Net-101 [26] as backbone and perform Image Ne Xt pretraining with them. In the experiments, unless otherwise specified, the Omni Segmentor uses the DFormer-L backbone. Image Ne Xt pretraining adopt the same hyperparameters as DFormer-L. Following the commonly used pretraining durations [38, 62, 23, 66], Omni Segmentor is pretrained for 300 epochs. We use Adam W [30] with learning rate 1e-3 and weight decay 5e-2 as our optimizer, and the batch size is set to 1024. |