Self-Supervised Visual Representation Learning from Hierarchical Grouping

Authors: Xiao Zhang, Michael Maire

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our approach can serve as state-of-the-art generic pre-training, benefiting downstream tasks. We additionally explore applications to semantic region search and video-based object instance tracking.
Researcher Affiliation Academia Xiao Zhang University of Chicago zhang7@uchicago.edu Michael Maire University of Chicago mmaire@uchicago.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We experiment on datasets of complex scenes, with variable numbers of object instances: PASCAL [14] and COCO [31]. ... The COCO-2014 [31] dataset provides instance and semantic segmentations for 81 foreground object classes on over 80K training images. ... We also benchmark learned embeddings on the DAVIS-2017 [40] dataset... We instead turn to structured edges (SE) [11], which only leverages the small supervised BSDS [34] for training. ... Image Net [9]
Dataset Splits Yes PASCAL provides 1464 and 1449 pixel-wise annotated images for training and validation, respectively. ... We evaluate learned embeddings on the PASCAL val set by training a pixel-wise classifier for semantic segmentation on PASCAL train_aug, set atop frozen features.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper only mentions software by name (e.g., Adam, Deep Lab V3) without specific version numbers for dependencies.
Experiment Setup Yes We use Adam [23] to train our model for 80 epochs with batch size 70. We initialize learning rate as 1e-2 which is then decayed by 0.1 at 25, 45, 60 epochs, respectively. We perform data augmentation including random resized cropping, random horizontal flipping, and color jittering on input images, which are then resized to 224x224 before being fed into the network. For one image, we randomly sample 7 regions and, for each region, sample 10 positive pixels and 5 negative pixels. We use σp = 0.8 for all experiments. In experiments fine-tuning on PASCAL train_aug... Here, we use SGD with weight decay 5e-4 and momentum 0.9 to optimize the pixel-wise cross entropy loss for 20K iterations with batch size 20. We randomly crop and resize images to 384x384 patches. The learning rate starts at 0.03 and decays by 0.1 at 10K and 15K iterations.