On the Comparison between Multi-modal and Single-modal Contrastive Learning

Authors: Wei Huang, Andi Han, Yongqiang Chen, Yuan Cao, Zhiqiang Xu, Taiji Suzuki

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical experiments on both synthetic and real-world datasets further consolidate our theoretical findings.
Researcher Affiliation Academia Wei Huang RIKEN AIP wei.huang.vr@riken.jp Andi Han RIKEN AIP andi.han@riken.jp Yongqiang Chen The Chinese University of Hong Kong yqchen@cse.cuhk.edu.hk Yuan Cao The University of Hong Kong yuancao@hku.hk Zhiqiang Xu MBZUAI zhiqiang.xu@mbzuai.ac.ae Taiji Suzuki University of Tokyo & RIKEN AIP taiji@mist.i.u-tokyo.ac.jp
Pseudocode No The paper includes mathematical equations and derivations (e.g., in Sections 3.1, 3.2, and 5), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes We have also uploaded the code in the supplementary material.
Open Datasets Yes Synthetic experiments We conduct synthetic experiments to verify the theoretical results obtained in the previous sections. We generate samples following the theoretical setups, where we set the data dimension d = 2000, number of training samples n = 100, number of test samples ntest = 200, and the hidden size of all encoders as m = 50. [...] Real-world experiments We now extend the comparison of single-modal and multi-modal learning to realistic image data, Colored MNIST [3, 54], which is a typical benchmark studying the generalization capability under distribution shifts.
Dataset Splits No The paper specifies training and test sets but does not explicitly mention or detail a separate validation split for either the synthetic or real-world experiments. For example, it states, 'number of training samples n = 100, number of test samples ntest = 200' and describes the setup for 'training set' and 'test set' for Colored MNIST, but no 'validation set'.
Hardware Specification Yes We run all the experiments on Linux servers with NVIDIA V100 graphics cards and CUDA 11.2, completing them within one hour.
Software Dependencies No The paper states 'We implement our methods using Py Torch.' and 'CUDA 11.2'. While CUDA has a version, PyTorch does not, which is a key software dependency. Therefore, complete version information is not provided.
Experiment Setup Yes We adopt gradient descent with a learning rate of 0.01 as the optimizer to train the model by 200 epochs. In the single-modal setting, the µ is set to be [5, 0, ..., 0]T and the ξ N(0, I) for the in-distribution data, and the augmentation vector ϵ N(0, 0.01 I). For the multi-modal setting, µ = [0, 15, 0, ..., 0]T . In addition, for the OOD test data xtest = [ν , ζ ] Dtest, we set ν = [2, 0, ..., 0] and ζ N(0, I). [...] For the training set, 10% of labels will be clipped to a random class. For images with class 0 (or 1 ), they will be colored as red (or green) with a probability of 77.5%, and as another random color with a probability of 22.5%.