Mix-and-Match Tuning for Self-Supervised Semantic Segmentation
Authors: Xiaohang Zhan, Ziwei Liu, Ping Luo, Xiaoou Tang, Chen Loy
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method achieves state-of-the-art performance on semantic segmentation, outperforming (Larsson, Maire, and Shakhnarovich 2016) by 14.3% and (Larsson, Maire, and Shakhnarovich 2017) by 8.5% when using VGG-16 as backbone network. Notably, our M&M selfsupervision paradigm shows comparable results (0.3% point of advantage) to its Image Net pre-trained counterpart. Furthermore, on PASCAL VOC2012 test set, our approach achieves 64.3% m Io U, which is a record-breaking performance for self-supervision methods. Qualitative results of this model are shown in Fig. 6. We additionally perform an ablation study on the Alex Net setting. As shown in Table 1, with colorization task as pre-training, our class-wise connected graph outperforms random triplets by 2.5%, suggesting the importance of class-wise connected graph. With random initialization, our model surprisingly performs even better than colorization pre-training. |
| Researcher Affiliation | Academia | Xiaohang Zhan, Ziwei Liu, Ping Luo, Xiaoou Tang, Chen Change Loy Department of Information Engineering, The Chinese University of Hong Kong {zx017, lz013, pluo, xtang, ccloy}@ie.cuhk.edu.hk |
| Pseudocode | No | The paper describes the steps of the Mix-and-Match tuning and graph construction in detail but does not provide formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: http://mmlab.ie.cuhk.edu.hk/projects/M&M/ |
| Open Datasets | Yes | In M&M tuning, we make use of PASCAL VOC2012 dataset (Everingham et al. 2010), which consists of 10,582 training samples with pixel-wise annotations. The same dataset is used in (Noroozi and Favaro 2016; Larsson, Maire, and Shakhnarovich 2017) for fine-tuning so no additional data is used in M&M. For fair comparisons, all self-supervision methods are benchmarked on PASCAL VOC2012 validation set that comes with 1,449 images. We further apply our method on the City Scapes dataset (Cordts et al. 2016), with 2,974 training samples and report results on the 500 validation samples. |
| Dataset Splits | Yes | In M&M tuning, we make use of PASCAL VOC2012 dataset (Everingham et al. 2010), which consists of 10,582 training samples with pixel-wise annotations. [...] All self-supervision methods are benchmarked on PASCAL VOC2012 validation set that comes with 1,449 images. [...] We further apply our method on the City Scapes dataset (Cordts et al. 2016), with 2,974 training samples and report results on the 500 validation samples. |
| Hardware Specification | Yes | It costs respectively 3.5 hours and 5.8 hours on a single TITAN-X for Alex Net and VGG-16, which are much faster than conventional Image Net pre-training or any other self-supervised pre-training task. |
| Software Dependencies | No | The paper mentions using Alex Net and VGG-16 architectures, but it does not specify version numbers for any software, libraries, or frameworks used (e.g., PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | From a batch of 16 images in each CNN iteration, we sample 10 patches per image with various sizes and resize them to a fixed size of 128 128. Then we extract pool5 feature of these patches from the CNN for later usage. We assign the patches unique labels as the central pixel labels using the corresponding label map. Then we perform the iterative strategy to construct the graph as discussed in the methodology section. We make use of each node in the graph as an anchor , which is made possible by our graph construction strategy. If any node whose label is unique among all the nodes, we duplicate it as its positive counterpart. In this way, we obtain a batch of meaningful triplets whose number is equal to the number of nodes, and feed them into a triplet loss layer, whose margin α is set as 2.1. Such a M&M tuning is conducted for 8000 iterations on PASCAL VOC2012 or City Scapes training dataset. The learning rate is fixed at 0.01 before iteration 6000, and then dropped to 0.001. We apply batch normalization to speed up convergence. |