CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

Authors: Tim Lebailly, Thomas Stegmüller, Behzad Bozorgtabar, Jean-Philippe Thiran, Tinne Tuytelaars

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first verify that the proposed pretraining method aligns well with the objective of in-context learning via nearest neighbor retrieval. We then show that in doing so, we do not compromise the performance on standard evaluations. General details on the experimental setup can be found in Appendix A.
Researcher Affiliation Academia Tim Lebailly1 Thomas Stegm uller2 Behzad Bozorgtabar2,3 Jean-Philippe Thiran2,3 Tinne Tuytelaars1 1KU Leuven 2EPFL 3CHUV
Pseudocode No The paper describes algorithms in text, particularly in Appendix B, but does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Our code and pretrained models are publicly available at https://github.com/tileb1/Cr IBo.
Open Datasets Yes Our pretraining datasets include COCO (Lin et al., 2014) and Image Net-1k (Deng et al., 2009).
Dataset Splits Yes The k-NN classifier is fitted on the local representations of a uniformly sub-sampled set of training images and evaluated on all the patches from the validation set of images. We report the m Io U scores on Pascal VOC 2012 (Everingham et al.) and ADE20K (Zhou et al., 2017). The validation set incorporates 1,449 images for Pascal VOC 2012 and The training set comprises 20,210 images, and the validation set consists of 2,000 images for ADE20K.
Hardware Specification Yes Experiments are run on a single node with 4x AMD MI250x (2 compute die per GPU i.e., worldsize = 8) with a memory usage of 43.5 GB per compute die.
Software Dependencies No The paper mentions using `MMSegmentation (Contributors, 2020)` and `Adam optimizer (Kingma & Ba, 2014)`, but does not provide specific version numbers for software libraries like PyTorch, CUDA, or for MMSegmentation itself.
Experiment Setup Yes The Vi T-small (Vi T-S/16) is trained for 800 epochs, while the Vi T-base (Vi T-B/16) is trained for 400 epochs. Pretrainings on COCO use a batchsize of 256 while pretrainings on Image Net-1k use a batchsize of 1024. Learning rate, weight-decay, and other optimization-related hyperparameters are exactly the same as in DINO (Caron et al., 2021). Results reported in tables using Vi T-S/16 (apart from the gridsearch) are based on the following hyperparameters: (λpos, S, K) = (1.0, 25k, 32) and (λpos, S, K) = (2.0, 25k, 64) for pretrainings on Image Net-1K and COCO, respectively.