reproducibilityindex.ai

Contrastive Transformation for Self-supervised Correspondence Learning

Authors: Ning Wang, Wengang Zhou, Houqiang Li10174-10182

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our simple framework outperforms the recent selfsupervised correspondence methods on a range of visual tasks including video object tracking (VOT), video object segmentation (VOS), pose keypoint tracking, etc. It is worth mentioning that our method also surpasses the fully-supervised afﬁnity representation (e.g., Res Net) and performs competitively against the recent fully-supervised algorithms designed for the speciﬁc tasks (e.g., VOT and VOS).
Researcher Affiliation	Academia	Ning Wang1, Wengang Zhou1, Houqiang Li1,2 1CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China 2Institute of Artiﬁcial Intelligence, Hefei Comprehensive National Science Center
Pseudocode	Yes	Algorithm 1: Ofﬂine Training Process
Open Source Code	Yes	The source code and pretrained model will be available at https://github.com/594422814/ContrastCorr
Open Datasets	Yes	The training dataset is Tracking Net (M uller et al. 2018) with about 30k video. Note that previous works (Wang, Jabri, and Efros 2019; Li et al. 2019) use the Kinetics dataset (Zisserman et al. 2017), which is much larger in scale than Tracking Net. Our framework randomly crops and tracks the patches of 256 256 pixels (i.e., patch-level tracking), and further yields a 32 32 intra-video afﬁnity (i.e., the network stride is 8).
Dataset Splits	Yes	In Table 1, we show ablative experiments of our method on the DAVIS-2017 validation dataset (Ponttuset et al. 2017).
Hardware Specification	Yes	The training stage takes about one day on 4 Nvidia 1080Ti GPUs.
Software Dependencies	No	The paper mentions using a Res Net-18 backbone network but does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	The training dataset is Tracking Net (M uller et al. 2018) with about 30k video. Our framework randomly crops and tracks the patches of 256 256 pixels (i.e., patch-level tracking), and further yields a 32 32 intra-video afﬁnity (i.e., the network stride is 8). The batch size is 16. Therefore, each positive embedding contrasts with 15 (32 32 2) = 30720 negative embeddings. Since our method considers pixel-level features, a small batch size also involves abundant contrastive samples. We ﬁrst train the intra-video transformation (warm-up stage) for the ﬁrst 100 epochs and then train the whole framework in an end-to-end manner for another 100 epochs. The learning rate of both two stages is 1 10 4 and will be reduced by half every 40 epochs.