Matrix Information Theory for Self-Supervised Learning

Authors: Yifan Zhang, Zhiquan Tan, Jingqin Yang, Weiran Huang, Yang Yuan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results reveal that Matrix-SSL outperforms state-of-the-art methods on the Image Net dataset under linear evaluation settings and on MS-COCO for transfer learning tasks. Specifically, when performing transfer learning tasks on MS-COCO, our method outperforms previous SOTA methods such as Mo Co v2 and BYOL up to 3.3% with only 400 epochs compared to 800 epochs pre-training. We also try to introduce representation learning into the language modeling regime by fine-tuning a 7B model using matrix cross-entropy loss, with a margin of 3.1% on the GSM8K dataset over the standard cross-entropy loss.
Researcher Affiliation Academia 1IIIS, Tsinghua University, Beijing, China 2Department of Mathematical Sciences, Tsinghua University, Beijing, China 3MIFA Lab, Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University, Shanghai, China 4Shanghai AI Laboratory, Shanghai, China 5Shanghai Qizhi Institute, Shanghai, China. Correspondence to: Yang Yuan <yuanyang@tsinghua.edu.cn>.
Pseudocode Yes Algorithm 1: Py Torch-style Pseudo-code for Matrix-SSL
Open Source Code Yes The code is available at https://github.com/ yifanzhang-pro/Matrix-SSL.
Open Datasets Yes In this section, we implement our proposed Matrix-SSL method for self-supervised learning tasks on Image Net (Deng et al., 2009) dataset. ... we finetune the pre-trained models on MS-COCO (Lin et al., 2014) object detection and instance segmentation tasks. ... We evaluated the performance of different models on the mathematical reasoning dataset GSM8K (Cobbe et al., 2021) and MATH dataset (Hendrycks et al., 2021) ... fine-tune it on the Meta Math dataset (Yu et al., 2023).
Dataset Splits Yes We follow the standard linear evaluation protocol (Chen et al., 2020a; Grill et al., 2020; Chen & He, 2021). ... The Linear evaluation of the Top-1 accuracy result when pre-trained with 100, 200, and 400 epochs on Image Net (Deng et al., 2009) dataset was shown in Table 1.
Hardware Specification No No specific hardware details such as GPU models (e.g., NVIDIA A100, RTX 2080 Ti) or CPU models were mentioned in the paper.
Software Dependencies No While "PyTorch-style Pseudo-code" is mentioned for Algorithm 1, no specific version number for PyTorch or any other software dependency is provided.
Experiment Setup Yes For pre-training, we use SGD optimizer with 2048 batch size, 10-5 weight decay, 0.9 momentum, and 4.0 base learning rate, which is scheduled by cosine decay learning rate scheduler (Loshchilov & Hutter, 2016), to optimize the online network over training process. For the momentum used for the exponential moving average process, it is set to be 0.996 to 1 scheduled by another cosine scheduler. As for linear evaluation, we use LARS optimizer (You et al., 2017) with 4096 batch size, 0.9 momentum, no weight decay, and 0.03 base learning rate scheduled by cosine decay learning rate scheduler, to train the linear layer over 100 epochs, and report the performance of last epoch.