LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures
Authors: Vimal Thilak, Chen Huang, Omid Saremi, Laurent Dinh, Hanlin Goh, Preetum Nakkiran, Joshua M. Susskind, Etai Littwin
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that Li DAR significantly surpasses naive rank based approaches in its predictive power of optimal hyperparameters. Our proposed criterion presents a more robust and intuitive means of assessing the quality of representations within JE architectures, which we hope facilitates broader adoption of these powerful techniques in various domains. |
| Researcher Affiliation | Industry | Vimal Thilak, Chen Huang, Omid Saremi, Laurent Dinh, Hanlin Goh, Preetum Nakkiran, Josh Susskind, Etai Littwin Apple Correspondence to: {vthilak, elittwin}@apple.com |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper refers to existing open-source implementations that were used (e.g., VISSL, VICReg's reference implementation, DINO's reference implementation, data2vec's reference implementation), but does not state that the code specific to this paper's methodology or experiments is open-sourced. |
| Open Datasets | Yes | We use the Imagenet-1k dataset (Russakovsky et al., 2015) for all experiments. We use the train split as the source dataset for pretraining and linear probing, and use the test split as the target dataset. |
| Dataset Splits | Yes | For each pretrained checkpoint, we train a linear probe on the train split, which we denote as the oracle, and record its test performance on the test split. ...The representations from the backbone are evaluated via standard linear probing by training a linear layer on Image Net-1k training split and calculating test accuracy on the validation split. |
| Hardware Specification | No | The paper mentions that "The feature extraction is done on one GPU while the metrics are implemented on the CPU" for runtime comparison, but it does not specify the model or detailed specifications of the GPU, CPU, or other hardware used for the main experiments. |
| Software Dependencies | No | The paper mentions various optimizers (Adam, LARS, SGD with Nesterov momentum) and refers to existing implementations for models (VICReg, DINO, SimCLR, data2vec, I-JEPA), but it does not specify version numbers for general software dependencies such as Python, PyTorch, TensorFlow, CUDA, or specific library versions. |
| Experiment Setup | Yes | We vary different hyperparameters per method. The varied hyperparameters range from optimization related ones such as learning rate, and weight decay, architecture specific hyperparameters such as softmax temperature, and data augmentation and masking based hyperparameters. ...Self-supervised training is run for 600 epochs with an effective batch size of 2048... The probe is optimized with Adam (Kingma & Ba, 2015) optimizer for 20 epochs with a starting learning rate of 0.01 and a step learning rate schedule where the base learning rate is dropped by a factor 10 after 15 epochs. |