Weighted Ensemble Self-Supervised Learning
Authors: Yangjun Ruan, Saurabh Singh, Warren Richard Morningstar, Alexander A Alemi, Sergey Ioffe, Ian Fischer, Joshua V. Dillon
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of our method is demonstrated with two state-of-the-art SSL methods, DINO (Caron et al., 2021) and MSN (Assran et al., 2022). Our method outperforms both in multiple evaluation metrics on Image Net-1K, particularly in the few-shot setting. Thorough experiments yield improved prior art baselines which our method still surpasses |
| Researcher Affiliation | Collaboration | Yangjun Ruan Saurabh Singh Warren Morningstar Alexander A. Alemi Sergey Ioffe Ian Fischer Joshua V. Dillon Google Research University of Toronto & Vector Institute. Work done as a student researcher at Google. |
| Pseudocode | Yes | See Appx. A for pseudocode. Algorithm 1: Pseudocode for computing ensemble loss Algorithm 2: Pseudocode for ensemble heads with simplified DINO |
| Open Source Code | No | The paper mentions 'DINO’s publicly-available pretrained weights' and references 'the official DINO implementation' and 'public MSN code' (with URLs in footnotes). However, it does not explicitly state that the authors have released the source code for their *own* proposed methodology or provide a link to it. |
| Open Datasets | Yes | We experimented with DINO (Caron et al., 2021) and MSN (Assran et al., 2022) models on Image Net ILSVRC-2012 dataset (Deng et al., 2009). ... We used the 1-/2-/5-shot Image Net dataset splits3 in Assran et al. (2022) and 1% ( 13-shot) Image Net dataset splits4. |
| Dataset Splits | Yes | For 1-/2-/5-shots evaluation results, we report the mean accuracy and standard deviation across 3 random splits of the data following Assran et al. (2022). ... For all few-shot evaluations, we searched the L2 regularization strength over {1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1, 3, 10}. |
| Hardware Specification | Yes | We benchmarked the wall-clock time and peak memory on 128 TPUv3 cores. |
| Software Dependencies | No | The paper mentions using 'JAX', 'AdamW optimizer', 'scikit-learn package', and 'tensorflow-datasets (tfd)' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | In particular, all models were trained with Adam W optimizer (Loshchilov & Hutter, 2018) and a batch size of 1024. The learning rate was linearly warmuped to 0.002 (=0.001 batch size/512) and followed a cosine decay schedule. The weight decay followed a cosine schedule from 0.04 to 0.4. The momentum rate for the teacher was increased from 0.996 to 1 with a cosine schedule following BYOL (Grill et al., 2020). A stochastic depth (Huang et al., 2016) of 0.1 was applied without dropout (Srivastava et al., 2014). The student temperature τ is set to 0.1. ... We used a 3-layer projection head with a hidden dimension of 1024. ... Tables 7 and 8 provide detailed hyper-parameters for training. |