Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Closer Look at Multimodal Representation Collapse

Authors: Abhra Chaudhuri, Anjan Dutta, Tu Bui, Serban Georgescu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https: //abhrac.github.io/mmcollapse/.
Researcher Affiliation Collaboration 1Fujitsu Research of Europe 2University of Surrey. Correspondence to: Abhra Chaudhuri <EMAIL>.
Pseudocode No The paper describes an algorithm called Explicit Basis Reallocation (EBR) in Section 3.4, detailing the optimization criteria and updates using mathematical notation. However, it does not present this as a formally structured pseudocode block or algorithm figure.
Open Source Code No The abstract mentions a 'Project page: https: //abhrac.github.io/mmcollapse/'. This is a high-level project overview page and not a direct link to a source-code repository, nor does the text explicitly state that the code for the described methodology is released or available.
Open Datasets Yes We choose the MIMIC-IV (Johnson et al., 2023) and av MNIST (Vielzeuf et al., 2018) datasets for our experiments.
Dataset Splits Yes For MIMIC-IV, we follow the same settings as (Wu et al., 2024) and that of (Wang et al., 2023; Ma et al., 2021) for av MNIST. [...] To evaluate our approach, we adopt the experimental setup of MUSE (Wu et al., 2024), following which we mask out the modalities in the MIMIC-IV dataset with probabilities {0.1, 0.2, 0.3, 0.4, 0.7}.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments, such as specific GPU or CPU models, or memory specifications.
Software Dependencies No The paper mentions using 'Tian et al. (2020) as our cross-modal knowledge distillation (KD) algorithm of choice applied on top of MUSE (Wu et al., 2024)', but it does not specify version numbers for any software libraries, frameworks, or tools.
Experiment Setup Yes The two hidden layers of ψ have output dimensionalities 512 and 256 respectively. The hidden layers of h have output dimensionalities 1024 and 512 respectively, whereas that of h 1 is 512 and 1024. The model was trained for 1200 epochs, with an initial learning rate of 0.01, decayed at a rate of 0.9 every 100 epochs. We interleave between the optimization of Lmd and Lsem every 10 epochs.