reproducibilityindex.ai

The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning

Authors: Zixin Wen, Yuanzhi Li

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we ﬁnd that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective.
Researcher Affiliation	Academia	Zixin Wen Machine Learning Department Carnegie Mellon University zixinw@andrew.cmu.edu Yuanzhi Li Machine Learning Department Carnegie Mellon University yuanzhil@andrew.cmu.edu
Pseudocode	No	The paper describes algorithms and mathematical formulations in prose and equations, but it does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets	Yes	Our results provide a completely different perspective compared to them: We explain why training the prediction head can encourage the network to learn diversiﬁed features and avoid dimensional collapses, even when the trivial collapsed optima still exist in the training objective, which is not covered by the prior works, as shall be discussed below. 1.1 Comparison to Similar Studies In this section, we will clarify the differences between our results and some similar studies. We point out that all the claims below are derived only in our theoretical setting and are partially veriﬁed in experiments over datasets such as CIFAR-10, CIFAR-100, and STL-10.
Dataset Splits	No	The paper states in its ethics checklist that training details and data splits were specified, but the main text does not provide explicit percentages or sample counts for training, validation, or test splits. It only mentions sampling data points and data augmentation.
Hardware Specification	No	The ethics checklist for question 3(d) explicitly states: "Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]"
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup	Yes	Initialization and hyper-parameters. At t = 0, we initialize W and E as W (0) i,j N(0, 1/d) and E(0) = Im and we only train the off-diagonal entries of E(t). For the simplicity of analysis, we let m = 2, which sufﬁces to illustrate our main message. For the learning rates, we let η (0, 1/poly(d)] be sufﬁciently small and ηE [η/αO(1) 1 , η/polylog(d)], which is smaller than η2.