MST: Masked Self-Supervised Transformer for Visual Representation
Authors: Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. |
| Researcher Affiliation | Collaboration | National Laboratory of Pattern Recognition, Institute of Automation, CAS School of Artificial Intelligence, University of Chinese Academy of Sciences Sense Time Research University of California, Los Angeles |
| Pseudocode | Yes | Algorithm 1 Pseudo code of attention-guided mask strategy in a Py Torch-like style. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | Dataset and Models Our method is validated on the popular Image Net 1k dataset [9]. ... We perform object detection experiments with MS COCO [18] dataset and Mask R-CNN detector [15] framework. ... SETR [37] provide a semantic segmentation framework for standard Vision Transformer. Hence, we adopt the SETR as the semantic segmentation strategy on Cityscapes [8]. |
| Dataset Splits | Yes | This dataset contains 1.28M images in the training set and 50K images in the validation set from 1000 classes. We only use the training set during the process of self-supervised learning. ... MS COCO is a popular benchmark for object detection, with 118K images in training set and 5K images for validation. ... Cityscapes contains 5000 images, with 19 object categories annotated in pixel level. There are 2975, 500, and 1525 images in training, validation, and testing set respectively. |
| Hardware Specification | Yes | Throughput (im/s) is calculated on a single NVIDIA V100 GPU with batch size 128. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | Our model is optimized by Adam W [22] with learning rate 2 10 3 and batch size 1024. Weight decay is set to be 0.04. We adopt learning rate warmup [12] in the first 10 epochs, and after warmup the learning rate follows a cosine decay schedule [21]. The model uses multi-crop similar to [1] and data augmentations similar to [13]. The setting of momentum, temperature coefficient, and weight decay follows [2]. The coefficient λ1 of basic instance discrimination task is set as 1.0 while the restoration task λ2 is set as 0.6. |