The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning
Authors: Zixin Wen, Yuanzhi Li
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. |
| Researcher Affiliation | Academia | Zixin Wen Machine Learning Department Carnegie Mellon University zixinw@andrew.cmu.edu Yuanzhi Li Machine Learning Department Carnegie Mellon University yuanzhil@andrew.cmu.edu |
| Pseudocode | No | The paper describes algorithms and mathematical formulations in prose and equations, but it does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | Yes | Our results provide a completely different perspective compared to them: We explain why training the prediction head can encourage the network to learn diversified features and avoid dimensional collapses, even when the trivial collapsed optima still exist in the training objective, which is not covered by the prior works, as shall be discussed below. 1.1 Comparison to Similar Studies In this section, we will clarify the differences between our results and some similar studies. We point out that all the claims below are derived only in our theoretical setting and are partially verified in experiments over datasets such as CIFAR-10, CIFAR-100, and STL-10. |
| Dataset Splits | No | The paper states in its ethics checklist that training details and data splits were specified, but the main text does not provide explicit percentages or sample counts for training, validation, or test splits. It only mentions sampling data points and data augmentation. |
| Hardware Specification | No | The ethics checklist for question 3(d) explicitly states: "Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]" |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | Initialization and hyper-parameters. At t = 0, we initialize W and E as W (0) i,j N(0, 1/d) and E(0) = Im and we only train the off-diagonal entries of E(t). For the simplicity of analysis, we let m = 2, which suffices to illustrate our main message. For the learning rates, we let η (0, 1/poly(d)] be sufficiently small and ηE [η/αO(1) 1 , η/polylog(d)], which is smaller than η2. |