Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning

Authors: Zixin Wen, Yuanzhi Li

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective.
Researcher Affiliation Academia Zixin Wen Machine Learning Department Carnegie Mellon University EMAIL Yuanzhi Li Machine Learning Department Carnegie Mellon University EMAIL
Pseudocode No The paper describes algorithms and mathematical formulations in prose and equations, but it does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets Yes Our results provide a completely different perspective compared to them: We explain why training the prediction head can encourage the network to learn diversified features and avoid dimensional collapses, even when the trivial collapsed optima still exist in the training objective, which is not covered by the prior works, as shall be discussed below. 1.1 Comparison to Similar Studies In this section, we will clarify the differences between our results and some similar studies. We point out that all the claims below are derived only in our theoretical setting and are partially verified in experiments over datasets such as CIFAR-10, CIFAR-100, and STL-10.
Dataset Splits No The paper states in its ethics checklist that training details and data splits were specified, but the main text does not provide explicit percentages or sample counts for training, validation, or test splits. It only mentions sampling data points and data augmentation.
Hardware Specification No The ethics checklist for question 3(d) explicitly states: "Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]"
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes Initialization and hyper-parameters. At t = 0, we initialize W and E as W (0) i,j N(0, 1/d) and E(0) = Im and we only train the off-diagonal entries of E(t). For the simplicity of analysis, we let m = 2, which suffices to illustrate our main message. For the learning rates, we let η (0, 1/poly(d)] be sufficiently small and ηE [η/αO(1) 1 , η/polylog(d)], which is smaller than η2.