Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

Authors: Shentong Mo, Peter Tong

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the Image Net-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.
Researcher Affiliation Academia Shentong Mo1 , Shengbang Tong2. Corresponding author: shentongmo@gmail.com. 38th Conference on Neural Information Processing Systems (NeurIPS 2024). No explicit institutional names are provided for the authors; however, the paper is presented at NeurIPS, which is primarily an academic conference.
Pseudocode Yes C.2 Pseudo Code. The following pseudo-code outlines the batch-wise application of VICReg within the C-JEPA framework, providing clarity on the computational steps involved during training: # Pseudo-code for batch-wise VICReg application in C-JEPA for each batch in dataset: # Forward pass through the context and target encoders context_embeddings, target_embeddings = encode(batch)
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: See the supplemental material.
Open Datasets Yes Following previous methods [12, 2], we use Image Net-1K [36] for image classification, MS-COCO [37] for object detection and instance segmentation, and ADE20K [38, 39] for semantic segmentation. For video object segmentation, we use DAVIS-2017 dataset containing 60 training, 30 validation, and 60 testing videos. For low-level tasks, we follow the previous work [2] and use Clevr/Count and Clevr/Dist on Clevr [45] dataset.
Dataset Splits Yes For video object segmentation, we use DAVIS-2017 dataset containing 60 training, 30 validation, and 60 testing videos.
Hardware Specification No The paper specifies the Vision Transformer (ViT) architectures used (tiny, small, base, and large models) but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud compute instances) used for running the experiments.
Software Dependencies No The paper mentions using 'Adam W' as an optimizer and cites the paper where it was introduced, but it does not provide specific version numbers for Adam W itself or for any other software dependencies or libraries used for implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For input images, we resized the resolution to 224 224, i.e., H = W = 224. Following prior work [12, 2], we apply a patch size of 16, i.e., P = 16. We use the tiny, small, base, and large models of Vi T [43] architecture for experiments. We set the embedding dimension of the predictor to 384, and keep the number of self-attention heads the same as the backbone context-encoder. For the Vi T-T/16, Vi T-S/16, and Vi T-B/16 context-encoder, we set the depth of the predictor as 6. For Vi T-L/16 context-encoders, we set the depth of the predictor to 12. Following I-JEPA [2], we use Adam W to optimize the context-encoder and predictor weights. We train our model using the default batch size of 2048, and the learning rate linearly increased from 1e-4 to 1e-3 during the first 15 epochs of pre-training, and decay to 1e-6 following a cosine schedule. The weight decay is linearly increased from 0.04 to 0.4, and the target-encoder weights are initialized the same as the context-encoder weights, and updated via an exponential moving average. We use a momentum value of 0.996, and linearly increase this value to 1.0. For masking, we use the same strategy and settings as I-JEPA [2] for 4 possibly overlapping target block masks.