Unsupervised Learning of Dense Visual Representations

Authors: Pedro O. O. Pinheiro, Amjad Almahairi, Ryan Benmalek, Florian Golemo, Aaron C. Courville

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate its performance by seeinghowthelearnedfeaturescanbetransferredtodownstreamtasks, eitherasfeatureextractororusedfor fine-tuning. We show that (unsupervised) contrastive learning of dense representation are more effective than its global counterparts in many visual understanding tasks (instance and semantic segmentation, object detection, keypoint detection, correspondence and depth prediction). Perhaps more interestingly, VADe R unsupervised pretraining outperforms Image Net supervised pretraining at different tasks.
Researcher Affiliation Collaboration Pedro O. Pinheiro1, Amjad Almahairi, Ryan Y. Benmalek2, Florian Golemo13, Aaron Courville34 1Element AI, 2Cornell University, 3Mila, Université de Montréal, 4CIFAR Fellow
Pseudocode No The paper describes the approach and implementation details in prose but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper mentions using and evaluating with external codebases like 'Detectron2' and 'Time Cycle' and provides links to those, but it does not state that the authors are releasing their own source code for the VADe R method described in the paper.
Open Datasets Yes We train our model on the Image Net-1K [9] train split, containing approximately 1.28M images. We test the frozen features in two datasets for semantic segmentation (PASCAL VOC12 [17] and Cityscapess [7]) and one for depth prediction (NYU-depth v2 [51]). Table 2 shows results on DAVIS-2017 validation set [59]. All methods are trained and evaluated on COCO [42] with the standard metrics.
Dataset Splits Yes In all datasets, we train the linear model on the provided train set and evaluate on the validation set. Table 2 shows results on DAVIS-2017 validation set [59]. Table 4: Results on mask rcnn on object detection, instance segmentation and keypoint detection fine-tuned on COCO. We show results on val2017, averaged over 5 trials. Figure 4 shows results of fine-tuning on semantic segmentation (in PASCAL VOC12) and in depth prediction (NYU-d v2), assuming different amount of labeled data (we consider 2, 5, 10, 20, 50 and 100% of dataset).
Hardware Specification No The paper states 'We train using 4 GPUs' but does not specify the model or type of GPUs, or any other hardware components like CPU or memory.
Software Dependencies No The paper mentions using various models and frameworks like 'Res Net-50 [28]', 'FPN [41]', 'Mask R-CNN [27]', and 'Detectron2 [72]' but does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or operating systems.
Experiment Setup Yes We train using 4 GPUs with a batch size of 128 for about 6M iterations. We use a learning rate of 3e 7 and 3e 3 for the encoder and decoder, respectively. We set the temperature to 0.07. We set the size of the dictionary to 65,536 and use a momentum of 0.999. the batch normalization layers are trained with Sync Batch-Norm [58] and we add batch-norm on all FPN layers. All models are trained on a controlled setting for around 12 epochs (schedule 1x).