Self-Supervised Learning with Kernel Dependence Maximization

Authors: Yazhe Li, Roman Pogodin, Danica J. Sutherland, Arthur Gretton

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present our experimental setup, where we assess the performance of the representation learned with SSL-HSIC both with and without a target network. For evaluation, we retain the backbone as a feature extractor for downstream tasks. We evaluate the representation on various downstream tasks including classification, object segmentation, object detection and depth estimation.
Researcher Affiliation Collaboration Yazhe Li Deep Mind and Gatsby Unit, UCL yazhe@google.com Roman Pogodin Gatsby Unit, UCL roman.pogodin.17@ucl.ac.uk Danica J. Sutherland UBC and Amii dsuth@cs.ubc.ca Arthur Gretton Gatsby Unit, UCL arthur.gretton@gmail.com
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code is available at https://github.com/deepmind/ssl_hsic.
Open Datasets Yes For evaluation, we retain the backbone as a feature extractor for downstream tasks. We evaluate the representation on various downstream tasks including classification, object segmentation, object detection and depth estimation.
Dataset Splits Yes Table 1 reports the top-1 and top-5 accuracies obtained with SSL-HSIC on Image Net validation set, and compares to previous self-supervised learning methods.
Hardware Specification Yes We train the model with a batch size of 4096 on 128 Cloud TPU v4 cores.
Software Dependencies No The paper mentions 'LARS optimizer' but does not specify any software names with version numbers for reproducibility.
Experiment Setup Yes The output of the encoder is a 2048-dimension embedding vector, which is the representation used for downstream tasks. As in BYOL [25], our projector g and predictor q networks are 2-layer MLPs with 4096 hidden dimensions and 256 output dimensions. The outputs of the networks are batch-normalized and rescaled to unit norm before computing the loss. We use an inverse multiquadric kernel (IMQ) for the latent representation (approximated with 512 random Fourier features that are resampled at each step; see Appendix C for details) and a linear kernel for labels. γ in (4) is set to 3.