Understanding the Role of Nonlinearity in Training Dynamics of Contrastive Learning

Authors: Yuandong Tian

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Simulation verifies our theoretical findings. Preliminary experiments on simulated data verify our findings. Preliminary experiments on simulated data verify our findings. Finally, we also characterize the interaction between layers in 2-layer network: while in each receptive field, many patterns exist, those contributing to global patterns are prioritized to learn by the training dynamics. This global modulation changes the eigenstructure of the low-level covariance matrix so that relevant patterns are learned with higher probability. Simulation verifies our theoretical findings. We train a 2-layer network on this dataset. The 2-layer network has K = 10 disjoint RFs, within each RF, there are M = βP filters. Here β ≥ 1 is a hyper-parameter that controls the degree of over-parameterization. The network is trained with Info NCE loss and SGD with learning rate 2 × 10−3, momentum 0.9, and weight decay 5 × 10−3 for 5000 minibatches and batchsize 128. Code is in PyTorch runnable on a single modern GPU.
Researcher Affiliation Industry Yuandong Tian Meta AI (FAIR) yuandong@meta.com
Pseudocode No The paper does not include pseudocode or an algorithm block. It describes methods in text and mathematical equations.
Open Source Code No The paper states "Code is in PyTorch runnable on a single modern GPU." However, it does not provide an explicit link to a public repository, nor does it state that the code will be made open source or available in supplementary materials for the described methodology. It only implies that the code exists and can be run on a GPU, not that it is publicly accessible.
Open Datasets Yes We also run the same 2-layer network (as in Sec. E.2) on MNIST Deng (2012) dataset. In its training, the MNIST dataset consists of 50, 000 images, each with the size of 28 by 28. We split the 28-by-28 images into 4 disjoint receptive fields, each with a size of 14 by 14, just like Fig. 2(b). In each region, we vectorize the receptive field into 14 × 14 = 196 dimensional vector.
Dataset Splits No The paper mentions using a batch size of 128 for training and the total number of images in MNIST (50,000 for training), but does not specify a training, validation, or test split percentage or count. It only mentions using a smaller batchsize of 8 for MNIST because of the number of classes.
Hardware Specification No The paper states "Code is in PyTorch runnable on a single modern GPU." This is a general statement and does not provide specific details about the GPU model (e.g., NVIDIA A100, RTX 3090, etc.), CPU, or other hardware components used for running the experiments.
Software Dependencies No The paper mentions "PyTorch" as the framework used: "Code is in PyTorch runnable on a single modern GPU." However, it does not specify a version number for PyTorch or any other software dependencies, such as Python version or other libraries.
Experiment Setup Yes The network is trained with Info NCE loss and SGD with learning rate 2 × 10−3, momentum 0.9, and weight decay 5 × 10−3 for 5000 minibatches and batchsize 128. Code is in PyTorch runnable on a single modern GPU.