Contrastive Transformation for Self-supervised Correspondence Learning

Authors: Ning Wang, Wengang Zhou, Houqiang Li10174-10182

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our simple framework outperforms the recent selfsupervised correspondence methods on a range of visual tasks including video object tracking (VOT), video object segmentation (VOS), pose keypoint tracking, etc. It is worth mentioning that our method also surpasses the fully-supervised affinity representation (e.g., Res Net) and performs competitively against the recent fully-supervised algorithms designed for the specific tasks (e.g., VOT and VOS).
Researcher Affiliation Academia Ning Wang1, Wengang Zhou1, Houqiang Li1,2 1CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Pseudocode Yes Algorithm 1: Offline Training Process
Open Source Code Yes The source code and pretrained model will be available at https://github.com/594422814/ContrastCorr
Open Datasets Yes The training dataset is Tracking Net (M uller et al. 2018) with about 30k video. Note that previous works (Wang, Jabri, and Efros 2019; Li et al. 2019) use the Kinetics dataset (Zisserman et al. 2017), which is much larger in scale than Tracking Net. Our framework randomly crops and tracks the patches of 256 256 pixels (i.e., patch-level tracking), and further yields a 32 32 intra-video affinity (i.e., the network stride is 8).
Dataset Splits Yes In Table 1, we show ablative experiments of our method on the DAVIS-2017 validation dataset (Ponttuset et al. 2017).
Hardware Specification Yes The training stage takes about one day on 4 Nvidia 1080Ti GPUs.
Software Dependencies No The paper mentions using a Res Net-18 backbone network but does not provide specific software dependencies with version numbers.
Experiment Setup Yes The training dataset is Tracking Net (M uller et al. 2018) with about 30k video. Our framework randomly crops and tracks the patches of 256 256 pixels (i.e., patch-level tracking), and further yields a 32 32 intra-video affinity (i.e., the network stride is 8). The batch size is 16. Therefore, each positive embedding contrasts with 15 (32 32 2) = 30720 negative embeddings. Since our method considers pixel-level features, a small batch size also involves abundant contrastive samples. We first train the intra-video transformation (warm-up stage) for the first 100 epochs and then train the whole framework in an end-to-end manner for another 100 epochs. The learning rate of both two stages is 1 10 4 and will be reduced by half every 40 epochs.