Mind Your Augmentation: The Key to Decoupling Dense Self-Supervised Learning

Authors: Congpei Qiu, Tong Zhang, Yanhao Wu, Wei Ke, Mathieu Salzmann, Sabine Süsstrunk

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments, incorporating our solution into two CNN-based and two ViT-based methods, with results confirming the effectiveness of our approach. Moreover, we provide empirical evidence that our method significantly contributes to the disentanglement of feature representations among objects, both in quantitative and qualitative terms.
Researcher Affiliation Academia 1School of Software Engineering, Xi an Jiaotong University, China 2School of Computer and Communication Sciences, EPFL, Switzerland
Pseudocode Yes A ALGORITHM A.1 ORIGINAL CUTOUT FOR DENSE SSL Algorithm 1 Proposal-level Cutout (De Vries & Taylor, 2017) A.2 REGION COLLOABORATIVE CUTOUT Algorithm 2 Region Collaborative Cutout
Open Source Code Yes Code is available at https://github.com/ztt1024/dense SSL
Open Datasets Yes In the pre-training stage, we sample only 25% images from a batch to construct the de-coupling branch. As we target SSL on the multi-object datasets, following the protocol of (Bai et al., 2022; Wang et al., 2021), we pre-train each model on COCO train2017 for 800 epochs.
Dataset Splits Yes For COCO detection and instance segmentation, we fine-tune a Mask R-CNN detector (C4-backbone) on COCO train2017 with 1× schedule. The evaluation is performed on the COCO val2017 split. Similarly to image-level k-NN, we predict the label for each object in the evaluation set by finding the k-nearest object-level features in the training set. Tab. 3 shows the O-KNN and OKNND accuracy on COCO using train2017 for feature extraction and val2017 for evaluation.
Hardware Specification Yes Each model is pre-trained on COCO for 800 epochs with an 8-GPU 3090 machine.
Software Dependencies No The paper mentions various frameworks (e.g., ResNet-50, ViT-S/16, Faster R-CNN, Mask R-CNN, UperNet) and optimizers (e.g., SGD, LARS, AdamW) but does not specify exact version numbers for these software components or any programming languages.
Experiment Setup Yes To demonstrate the effectiveness and generalization capability of our decoupling strategy, we apply our module on Dense CL (Wang et al., 2021), So Co (Wei et al., 2021), Leopart (Ziegler & Asano, 2022), iBOT (Zhou et al., 2022) and Mask Align (Xue et al., 2023)... For the backbone, we employ ResNet-50 (He et al., 2016) on CNN-based methods and ViT-S/16 (Dosovitskiy et al., 2021) on Vit-based methods... we pre-train each model on COCO train2017 for 800 epochs. For a fair comparison, we adopt the same hyper-parameter for every method with or without the de-coupling strategy. For region generation and RCC mask, we divide the input view into 3x3 grids and create a single bounding box in each grid with the scale in (0.15, 4) and aspect ratio in the range (0.5, 2). The RCC cutout ratio is selected from the range [0.3, 0.5]. The de-coupled views are generated with the augmented key views... We set the de-coupling weight λDC to 0.3. For Dense CL, we follow the same setting to adopt an SGD optimizer with the base learning rate lrbase of 0.3. For So Co, we adopt LARS (You et al., 2017) with lrbase as 2.0 and batch size of 1024. For iBOT, we utilize AdamW (Loshchilov & Hutter, 2017) and set the lrbase to 1e-3 and the batch size of 512.