Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Diversity-oriented Deep Multi-modal Clustering

Authors: Wang Yanzheng, Xin Yang, Yujun Wang, Shizhe Hu, Mingliang Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this segment, we carry out a series of experiments to verify the efficacy of the framework we have proposed. For more experimental data and experimental details, please refer to Appendix A.3. ... Table 2: Clustering results in terms of ACC and NMI on the multi-modal datasets.
Researcher Affiliation Academia Yanzheng Wang#, Xin Yang#, Yujun Wang, Shizhe Hu , Mingliang Xu School of Computer and Artificial Intelligence, Zhengzhou University, China
Pseudocode Yes Algorithm 1 :Diversity-oriented Deep Multi-modal Clustering Input: Multi-modal datasets {Xm}M m=1; Number of clusters K; Trade-off parameters α and β; Epoch number E; Temperature parameters τ1. Initializing the network; Select the dominant modality by Eq. (3); for i = e to E do The dominant modality feature Ddm, consistency features {Cm}M m =dm and diversity features {Zm}M m =dm are obtained through modality specific encoders; The clustering results {Am}M m=1 of various modality are obtained through cluster layers; Calculate LF DL with Eq. (29), Eq. (31), Eq. (6) and Eq. (7); Calculate LCDL with Eq. (31), Eq. (9) and Eq. (10); Calculate LDDC by Eq. (12); Optimize all parameters by minimizing Eq. (13); end for Output:Multi-modal clustering assignment Q.
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Data and code are publicly available.
Open Datasets Yes We evaluate the effectiveness of our proposed method by employing five well-known datasets, including Caltech-3V, Caltech-4V, ESP-Game, Flickr and IAPR. The Caltech image dataset [29]... ESP-Game [34] dataset is derived from an online image tagging game... Flickr [22] dataset is a widely used multi-modal dataset for image retrieval... IAPR [35] is a comprehensive multi-modal image dataset...
Dataset Splits No The paper does not explicitly mention specific training, test, or validation splits for the datasets used in the experiments. Unsupervised clustering methods typically use the entire dataset for clustering and then evaluate the results against ground truth labels, implying the full dataset is used for the clustering task itself without explicit splits for model training.
Hardware Specification Yes Our experiments were conducted on a Windows 10 operating system, utilizing a powerful configuration equipped with 96 GB of system memory and a high-performance NVIDIA Ge Force RTX 4090D GPU.
Software Dependencies No We implemented the proposed framework using the Py Torch platform[51]. While PyTorch is mentioned, a specific version number for the library is not provided, which is necessary for full reproducibility.
Experiment Setup Yes For all datasets, the training batch size was uniformly set to 512, and we utilized the Adam optimization algorithm with an initial learning rate of 0.0003. The configuration of parameters in the proposed model is detailed as follows. The hyperparameters α and β are tuned to values ranging from 0.0001 to 1000, with each value being a power of 10. Given that 100 epochs proved to be ample for the training convergence of algorithm, we accordingly trained the model from the beginning up to 100 epochs. To enhance robustness and circumvent local minima, we trained the proposed model 10 times, reporting the clustering outcome with the minimal clustering loss. For all datasets, we utilized modal-specific variational encoders comprising three fully connected layers, each layer consists of a batch normalization layer and a RELU layer. The second layer and the output layer are set to 512. The clustering layer adopts a fully connected layer with a softmax layer. The parameter of the dropout layer is set to 0.1. The temperature hyperparameter in the comparative learning is set to 0.5.