Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables

Authors: Yu Gui, Cong Ma, Zongming Ma

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on both synthetic and real-world datasets demonstrate the ability of contrastive learning to learn low-dimensional and informative representations, bridging theoretical insights and practical performance.
Researcher Affiliation Academia 1 Department of Statistics and Data Science, University of Pennsylvania 2 Department of Statistics, University of Chicago 3 Department of Statistics and Data Science, Yale University EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes mathematical formulations and theoretical concepts but does not include a distinct pseudocode or algorithm block.
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Datasets are available online and reproducible codes are submitted as supplementary materials.
Open Datasets Yes Experiments on both synthetic and real-world datasets demonstrate the ability of contrastive learning to learn low-dimensional and informative representations, bridging theoretical insights and practical performance. 4 Numerical experiments In this section, we further justify the theoretical findings with both synthetic and real-world datasets. Starting with a synthetic dataset in Section 4.1, we further consider real datasets: a CITE-seq dataset [68, 69] in Section 4.2, Image Net V2 dataset [61] in Appendix 4.3, and YFCC dataset [72] in Appendix G.7.
Dataset Splits Yes We use a training set of size 12000 and a separate test set of size 2000. The total N = 14000 data points are partitioned into a training set Dtrain with |Dtrain| = 10000, a test set Dtest with |Dtest| = 2000, and a separate set with size 2000 for estimating the expected norm at each epoch. we randomly sample 20000 rows without replacement from the preprocessed dataset and randomly split the subset into a training set Dtrain with |Dtrain| = 10000, a test set Dtest with |Dtest| = 2000, and a separate set with size 8000 for estimating the expected norm at each epoch. with |Dtrain| = 8000, |Dtest| = 1000, and a separate dataset with size 1000 to estimate the expected norms
Hardware Specification No The paper does not explicitly describe any specific hardware used for running its experiments.
Software Dependencies No We estimate the global intrinsic dimension of data using the MLE-based approach proposed in [35], which is implemented in the skdim.id package.5 We follow the preprocessing in https://satijalab.org/seurat/ articles/weighted_nearest_neighbor_analysis
Experiment Setup Yes The neural network is trained for 800 epochs with learning rate 10 4 and weight decay 10 4, and a slightly faster rate is used for temperature τ: 10 3 in synthetic experiments and 2 10 4 for real data. This function class is denoted by Fp,d NN, where p is the input dimension adjusted for each setting.