Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals

Authors: Oliver Hahn, Nikita Araslanov, Simone Schaub-Meyer, Stefan Roth

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To assess the efficacy of our approach, we compare it to the current state-of-the-art in unsupervised semantic segmentation. For a fair comparison, we closely follow the overall setup used by numerous previous works (Ji et al., 2019; Cho et al., 2021; Hamilton et al., 2022; Seong et al., 2023). Overall, we observe that the DINO baseline already achieves strong results (cf. Tabs. 1 to 3). DINOv2 features significantly raise the supervised upper bounds in terms of Acc and m Io U, the improvement in the unsupervised case remains more modest. Nevertheless, Pri Ma Ps-EM further boosts the unsupervised segmentation performance. In Tab. 1, we compare to previous work on the Cityscapes dataset. Pri Ma Ps-EM leads to a consistent improvement over all baselines in terms of unsupervised segmentation accuracy. For example, Pri Ma Ps EM boosts DINO Vi T-S/8 by +3.6% and +19.8% in terms of m Io U and Acc, respectively, which leads to state-of-the-art performance. Notably, we find Pri Ma Ps-EM to be complementary to other state-of-the-art unsupervised segmentation methods like STEGO (Hamilton et al., 2022) and HP (Seong et al., 2023) on the corresponding backbone model. This suggests that these methods use their SSL representation only to a limited extent and do not fully leverage the inherent properties of the underlying SSL embeddings. Similar observations can be drawn for the experiments on COCO-Stuffin Tab. 2. Pri Ma Ps-EM leads to a consistent improvement across all four SSL baselines, as well as an improvement over STEGO and HP. For instance, combining STEGO with Pri Ma Ps-EM leads to +14.0% and +19.1% improvement over the baseline in terms of m Io U and Acc for DINO Vi T-B/8. Experiments on the Potsdam-3 dataset follow the same pattern (cf. Tab. 3). Pri Ma Ps-EM leads to a consistent gain over the baseline, e. g. +17.6% and +14.4% in terms of m Io U and Acc, respectively, for DINO Vi T-B/8.
Researcher Affiliation Academia Oliver Hahn1 Nikita Araslanov2,3 Simone Schaub-Meyer1,4 Stefan Roth1,4 1 Department of Computer Science, TU Darmstadt 2 Department of Computer Science, TU Munich 3 Munich Center for Machine Learning (MCML) 4hessian.AI
Pseudocode No The paper describes methods and processes through textual descriptions and figures (e.g., Fig. 2 "Pri Ma Ps process.", Fig. 3 "Pri Ma Ps-EM architecture" and "Pri Ma Ps pseudo-label generation"), but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Code is available at https://github.com/visinf/primaps. Both the code and models are publicly available at https://github.com/visinf/primaps.
Open Datasets Yes Datasets. Following the practice of previous work, we conduct experiments on Cityscapes (Cordts et al., 2016), COCO-Stuff(Caesar et al., 2018), and Potsdam-3 (ISPRS). Cityscapes and COCO-Stuffare evaluated using 27 classes, while Potsdam is evaluated on the 3-class variant. Adopting the established evaluation protocol (Ji et al., 2019; Cho et al., 2021; Hamilton et al., 2022; Seong et al., 2023), we resize images to 320 pixels along the smaller axis and crop the center 320 320 pixels. This is adjusted to 322 pixels for DINOv2. Different from previous work, we apply this simple scheme throughout this work, thus dispensing with elaborate multi-crop approaches of previous methods (Hamilton et al., 2022; Yin et al., 2022; Seong et al., 2023).
Dataset Splits Yes Cityscapes (Cordts et al., 2016) is an ego-centric street-scene dataset containing 5000 high-resolution images with 2048 1024 pixels. It is split into 2975 train, 500 val, and 1525 test images. Following previous work (Ji et al., 2019; Cho et al., 2021; Yin et al., 2022; Hamilton et al., 2022; Seong et al., 2023), evaluation is conducted on the 27 classes setup using the val split. COCO-Stuff (Caesar et al., 2018) is a dataset of everyday life scenes containing 80 things and 91 stuff classes. Following previous work (Ji et al., 2019; Cho et al., 2021; Hamilton et al., 2022; Yin et al., 2022; Li et al., 2023; Seong et al., 2023), we use a reduced variant by Ji et al. (2019) containing 49629 train and 2175 test images. Potsdam-3 (ISPRS) is a remote sensing dataset consisting of 8550 RGBIR satellite images with 200 200 pixels, which is split into 4545 train and 855 test images, as well as 3150 additional unlabeled images. In our experiments, the 3-label variant of Potsdam is evaluated and the additional unlabeled images are not used.
Hardware Specification Yes Experiments are conducted on a single NVIDIA A6000 GPU. We perform all experiments on a single NVIDIA A6000 GPU.
Software Dependencies No Our work is implemented in Py Torch (Paszke et al., 2019). We build up on the code of Ji et al. (2019), Van Gansbeke et al. (2021) and Hamilton et al. (2022). We initialize the class prototypes θM with the first K principal components; we use 2975 images for PCA, as this is the largest number of training images shared by all datasets. Next, θM is pre-trained by minimizing Eq. (9) using Adam (Kingma & Ba, 2015). For fitting the running class prototypes using EM, θR is optimized by minimizing the focal loss from Eq. (10) with Adam (Kingma & Ba, 2015).
Experiment Setup Yes We initialize the class prototypes θM with the first K principal components; we use 2975 images for PCA, as this is the largest number of training images shared by all datasets. Next, θM is pre-trained by minimizing Eq. (9) using Adam (Kingma & Ba, 2015). We use a learning rate of 0.005 for 2 epochs on all datasets and backbones. The weights are then copied to θR. For fitting the running class prototypes using EM, θR is optimized by minimizing the focal loss from Eq. (10) with Adam (Kingma & Ba, 2015) using a learning rate of 0.005. The momentum class prototypes θM are updated using an EMA according to Eq. (11) every γs = 10 steps with decay γψ = 0.98. We set the Pri Ma Ps mask-proposal threshold to ψ = 0.4 and provide detailed ablation experiments in Appendix A.2. We use a batch size of 32 for 50 epochs on Cityscapes and Potsdam-3, and use 5 epochs on COCO-Stuffdue to its larger size.