Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Authors: Shentong Mo, Peter Tong
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the Image Net-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics. |
| Researcher Affiliation | Academia | Shentong Mo1 , Shengbang Tong2. Corresponding author: shentongmo@gmail.com. 38th Conference on Neural Information Processing Systems (NeurIPS 2024). No explicit institutional names are provided for the authors; however, the paper is presented at NeurIPS, which is primarily an academic conference. |
| Pseudocode | Yes | C.2 Pseudo Code. The following pseudo-code outlines the batch-wise application of VICReg within the C-JEPA framework, providing clarity on the computational steps involved during training: # Pseudo-code for batch-wise VICReg application in C-JEPA for each batch in dataset: # Forward pass through the context and target encoders context_embeddings, target_embeddings = encode(batch) |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: See the supplemental material. |
| Open Datasets | Yes | Following previous methods [12, 2], we use Image Net-1K [36] for image classification, MS-COCO [37] for object detection and instance segmentation, and ADE20K [38, 39] for semantic segmentation. For video object segmentation, we use DAVIS-2017 dataset containing 60 training, 30 validation, and 60 testing videos. For low-level tasks, we follow the previous work [2] and use Clevr/Count and Clevr/Dist on Clevr [45] dataset. |
| Dataset Splits | Yes | For video object segmentation, we use DAVIS-2017 dataset containing 60 training, 30 validation, and 60 testing videos. |
| Hardware Specification | No | The paper specifies the Vision Transformer (ViT) architectures used (tiny, small, base, and large models) but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud compute instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Adam W' as an optimizer and cites the paper where it was introduced, but it does not provide specific version numbers for Adam W itself or for any other software dependencies or libraries used for implementation (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For input images, we resized the resolution to 224 224, i.e., H = W = 224. Following prior work [12, 2], we apply a patch size of 16, i.e., P = 16. We use the tiny, small, base, and large models of Vi T [43] architecture for experiments. We set the embedding dimension of the predictor to 384, and keep the number of self-attention heads the same as the backbone context-encoder. For the Vi T-T/16, Vi T-S/16, and Vi T-B/16 context-encoder, we set the depth of the predictor as 6. For Vi T-L/16 context-encoders, we set the depth of the predictor to 12. Following I-JEPA [2], we use Adam W to optimize the context-encoder and predictor weights. We train our model using the default batch size of 2048, and the learning rate linearly increased from 1e-4 to 1e-3 during the first 15 epochs of pre-training, and decay to 1e-6 following a cosine schedule. The weight decay is linearly increased from 0.04 to 0.4, and the target-encoder weights are initialized the same as the context-encoder weights, and updated via an exponential moving average. We use a momentum value of 0.996, and linearly increase this value to 1.0. For masking, we use the same strategy and settings as I-JEPA [2] for 4 possibly overlapping target block masks. |