Contrastive Transformation for Self-supervised Correspondence Learning
Authors: Ning Wang, Wengang Zhou, Houqiang Li10174-10182
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our simple framework outperforms the recent selfsupervised correspondence methods on a range of visual tasks including video object tracking (VOT), video object segmentation (VOS), pose keypoint tracking, etc. It is worth mentioning that our method also surpasses the fully-supervised affinity representation (e.g., Res Net) and performs competitively against the recent fully-supervised algorithms designed for the specific tasks (e.g., VOT and VOS). |
| Researcher Affiliation | Academia | Ning Wang1, Wengang Zhou1, Houqiang Li1,2 1CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center |
| Pseudocode | Yes | Algorithm 1: Offline Training Process |
| Open Source Code | Yes | The source code and pretrained model will be available at https://github.com/594422814/ContrastCorr |
| Open Datasets | Yes | The training dataset is Tracking Net (M uller et al. 2018) with about 30k video. Note that previous works (Wang, Jabri, and Efros 2019; Li et al. 2019) use the Kinetics dataset (Zisserman et al. 2017), which is much larger in scale than Tracking Net. Our framework randomly crops and tracks the patches of 256 256 pixels (i.e., patch-level tracking), and further yields a 32 32 intra-video affinity (i.e., the network stride is 8). |
| Dataset Splits | Yes | In Table 1, we show ablative experiments of our method on the DAVIS-2017 validation dataset (Ponttuset et al. 2017). |
| Hardware Specification | Yes | The training stage takes about one day on 4 Nvidia 1080Ti GPUs. |
| Software Dependencies | No | The paper mentions using a Res Net-18 backbone network but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | The training dataset is Tracking Net (M uller et al. 2018) with about 30k video. Our framework randomly crops and tracks the patches of 256 256 pixels (i.e., patch-level tracking), and further yields a 32 32 intra-video affinity (i.e., the network stride is 8). The batch size is 16. Therefore, each positive embedding contrasts with 15 (32 32 2) = 30720 negative embeddings. Since our method considers pixel-level features, a small batch size also involves abundant contrastive samples. We first train the intra-video transformation (warm-up stage) for the first 100 epochs and then train the whole framework in an end-to-end manner for another 100 epochs. The learning rate of both two stages is 1 10 4 and will be reduced by half every 40 epochs. |