Amortised Invariance Learning for Contrastive Self-Supervision
Authors: Ruchika Chavhan, Jan Stuehmer, Calum Heggan, Mehrdad Yaghoobi, Timothy Hospedales
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the notion of amortised invariances for contrastive learning over two different modalities: vision and audio, on two widely-used contrastive learning methods in vision: Sim CLR and Mo Co-v2 with popular architectures like Res Nets and Vision Transformers, and Sim CLR with Res Net-18 for audio. We show that our amortised features provide a reliable way to learn diverse downstream tasks with different invariance requirements, while using a single feature and avoiding taskspecific pre-training. 4 EXPERIMENTS: COMPUTER VISION TASKS |
| Researcher Affiliation | Collaboration | Ruchika Chavhan1, Jan Stuehmer2,3, , Calum Heggan1, Mehrdad Yaghoobi1, Timothy Hospedales1,4 1University of Edinburgh, 2Karlsruhe Institute of Technology, 3Heidelberg Institute for Theoretical Studies 4Samsung AI Research Centre, Cambridge |
| Pseudocode | Yes | Listing 1: Code for Res Net50 Bottleneck modified to incorporate hypernetworks. Basic code block has been adapted from the Pytorch Image Models Library Wightman (2019) |
| Open Source Code | No | The paper provides code snippets in the appendix, but does not explicitly state that the full source code for their methodology is open-sourced or provide a link to a repository. |
| Open Datasets | Yes | Pre-training Datasets: We perform self-supervised pre-training for Vi T-B and Res Net50 on the 1.28M Image Net training set Deng et al. (2009) and Image Net-100 (a 100-category subset of Image Net) following Chen et al. (2021) and Xiao et al. (2021) respectively. Downstream tasks: Our suite of downstream tasks consists of object recognition on standard benchmarks CIFAR10/100 (Krizhevsky et al., 2009), Caltech101 (Fei-Fei et al., 2004), Flowers (Nilsback & Zisserman, 2008), Pets (Parkhi et al., 2012), DTD (Cimpoi et al., 2014), CUB200 (Wah et al., 2011), as well as a set of spatially sensitive tasks including facial landmark detection on 300W SAG (2016), and Celeb A Liu et al. (2015), and pose estimation on Leeds Sports Pose Johnson & Everingham (2010). |
| Dataset Splits | Yes | For all datasets, train-test split protocol is adopted from Lee et al. (2021); Ericsson et al. (2022a). For linear evaluation benchmarks, we randomly choose validation samples in the training split for each dataset when the validation split is not officially provided. For Caltech101, we randomly select 30 images per class to form the training set and we test on the rest of the dataset. |
| Hardware Specification | Yes | Table 16 shows the time taken in GPU days on Tesla V100-SXM2-32GB for 300 pre-training epochs for Res Net50 and Vi Ts with Image Net-100 and Image Net-1k respectively. |
| Software Dependencies | No | The paper mentions "Pytorch Image Models Library Wightman (2019)" as the source for basic code blocks in the appendix, but it does not specify any version numbers for Pytorch or other software dependencies. |
| Experiment Setup | Yes | Both the models are pre-trained for 300 epochs with a batch size of 1024. We follow the optimization protocol in Chen et al. (2021) and use the Adam W optimiser along with learning rate warm-up for 40 epochs, followed by a cosine decay schedule. We use the Adam optimiser, with a batch size of 256, and sweep learning rate and weight decay parameters for each downstream dataset based on its validation set. We apply weight decay only on the parameters of the linear classifier. |