Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables
Authors: Yu Gui, Cong Ma, Zongming Ma
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on both synthetic and real-world datasets demonstrate the ability of contrastive learning to learn low-dimensional and informative representations, bridging theoretical insights and practical performance. |
| Researcher Affiliation | Academia | 1 Department of Statistics and Data Science, University of Pennsylvania 2 Department of Statistics, University of Chicago 3 Department of Statistics and Data Science, Yale University EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes mathematical formulations and theoretical concepts but does not include a distinct pseudocode or algorithm block. |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Datasets are available online and reproducible codes are submitted as supplementary materials. |
| Open Datasets | Yes | Experiments on both synthetic and real-world datasets demonstrate the ability of contrastive learning to learn low-dimensional and informative representations, bridging theoretical insights and practical performance. 4 Numerical experiments In this section, we further justify the theoretical findings with both synthetic and real-world datasets. Starting with a synthetic dataset in Section 4.1, we further consider real datasets: a CITE-seq dataset [68, 69] in Section 4.2, Image Net V2 dataset [61] in Appendix 4.3, and YFCC dataset [72] in Appendix G.7. |
| Dataset Splits | Yes | We use a training set of size 12000 and a separate test set of size 2000. The total N = 14000 data points are partitioned into a training set Dtrain with |Dtrain| = 10000, a test set Dtest with |Dtest| = 2000, and a separate set with size 2000 for estimating the expected norm at each epoch. we randomly sample 20000 rows without replacement from the preprocessed dataset and randomly split the subset into a training set Dtrain with |Dtrain| = 10000, a test set Dtest with |Dtest| = 2000, and a separate set with size 8000 for estimating the expected norm at each epoch. with |Dtrain| = 8000, |Dtest| = 1000, and a separate dataset with size 1000 to estimate the expected norms |
| Hardware Specification | No | The paper does not explicitly describe any specific hardware used for running its experiments. |
| Software Dependencies | No | We estimate the global intrinsic dimension of data using the MLE-based approach proposed in [35], which is implemented in the skdim.id package.5 We follow the preprocessing in https://satijalab.org/seurat/ articles/weighted_nearest_neighbor_analysis |
| Experiment Setup | Yes | The neural network is trained for 800 epochs with learning rate 10 4 and weight decay 10 4, and a slightly faster rate is used for temperature τ: 10 3 in synthetic experiments and 2 10 4 for real data. This function class is denoted by Fp,d NN, where p is the input dimension adjusted for each setting. |