Weighted Ensemble Self-Supervised Learning

Authors: Yangjun Ruan, Saurabh Singh, Warren Richard Morningstar, Alexander A Alemi, Sergey Ioffe, Ian Fischer, Joshua V. Dillon

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of our method is demonstrated with two state-of-the-art SSL methods, DINO (Caron et al., 2021) and MSN (Assran et al., 2022). Our method outperforms both in multiple evaluation metrics on Image Net-1K, particularly in the few-shot setting. Thorough experiments yield improved prior art baselines which our method still surpasses
Researcher Affiliation Collaboration Yangjun Ruan Saurabh Singh Warren Morningstar Alexander A. Alemi Sergey Ioffe Ian Fischer Joshua V. Dillon Google Research University of Toronto & Vector Institute. Work done as a student researcher at Google.
Pseudocode Yes See Appx. A for pseudocode. Algorithm 1: Pseudocode for computing ensemble loss Algorithm 2: Pseudocode for ensemble heads with simplified DINO
Open Source Code No The paper mentions 'DINO’s publicly-available pretrained weights' and references 'the official DINO implementation' and 'public MSN code' (with URLs in footnotes). However, it does not explicitly state that the authors have released the source code for their *own* proposed methodology or provide a link to it.
Open Datasets Yes We experimented with DINO (Caron et al., 2021) and MSN (Assran et al., 2022) models on Image Net ILSVRC-2012 dataset (Deng et al., 2009). ... We used the 1-/2-/5-shot Image Net dataset splits3 in Assran et al. (2022) and 1% ( 13-shot) Image Net dataset splits4.
Dataset Splits Yes For 1-/2-/5-shots evaluation results, we report the mean accuracy and standard deviation across 3 random splits of the data following Assran et al. (2022). ... For all few-shot evaluations, we searched the L2 regularization strength over {1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1, 3, 10}.
Hardware Specification Yes We benchmarked the wall-clock time and peak memory on 128 TPUv3 cores.
Software Dependencies No The paper mentions using 'JAX', 'AdamW optimizer', 'scikit-learn package', and 'tensorflow-datasets (tfd)' but does not provide specific version numbers for these software components.
Experiment Setup Yes In particular, all models were trained with Adam W optimizer (Loshchilov & Hutter, 2018) and a batch size of 1024. The learning rate was linearly warmuped to 0.002 (=0.001 batch size/512) and followed a cosine decay schedule. The weight decay followed a cosine schedule from 0.04 to 0.4. The momentum rate for the teacher was increased from 0.996 to 1 with a cosine schedule following BYOL (Grill et al., 2020). A stochastic depth (Huang et al., 2016) of 0.1 was applied without dropout (Srivastava et al., 2014). The student temperature τ is set to 0.1. ... We used a 3-layer projection head with a hidden dimension of 1024. ... Tables 7 and 8 provide detailed hyper-parameters for training.