Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond Modality Collapse: Representation Blending for Multimodal Dataset Distillation

Authors: xin zhang, Ziruo Zhang, JIAWEI DU, Zuozhu Liu, Joey Tianyi Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Flickr-30K and MS-COCO show that Rep Blend consistently outperforms prior state-of-the-art MDD methods, achieving significant gains in retrieval performance (e.g., +9.4 IR@10, +6.3 TR@10 under the 100-pair setting) and offering up to 6.7 distillation speedup.
Researcher Affiliation	Academia	1Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore 2Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore 3National University of Singapore, Singapore 4Zhejiang University, China EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Blending Representations to Mitigate Modality Collapse in MDD
Open Source Code	Yes	Our code is publicly available at https://github.com/zhangxin-xd/Rep Blend.
Open Datasets	Yes	Experiments on Flickr-30K and MS-COCO show that Rep Blend consistently outperforms prior state-of-the-art MDD methods [...] We evaluate our method on two widely-used image captioning datasets: Flickr-30K [36] and MS-COCO [27] [...] to the Audio Caps [23] audio-text benchmark [...] image encoder is initialized with Image Net-1K pretrained weights [8] [...] LLa VA-cc3m dataset [...] https://huggingface.co/xinxin66/Rep Blend/tree/main/datasets/cc3m [...] zero-shot Image Net [8] classiﬁcation and OCR-relevant retrieval on Text Caps [45].
Dataset Splits	Yes	We use approximately 60% of the data (about 334k pairs) for training and reserve a non-overlapping set of 10k pairs for validation4.
Hardware Specification	Yes	All experiments are conducted using two NVIDIA RTX 3090 GPUs and one NVIDIA H100 GPU.
Software Dependencies	No	The paper does not explicitly list software dependencies with version numbers in the main text.
Experiment Setup	Yes	Implementation Details. We construct a CLIP-style architecture using the aforementioned image and text encoders. The image encoder is initialized with Image Net-pretrained weights [8], while the text encoder is initialized with the ofﬁcial pretrained weights provided by the corresponding language model. After feature extraction, the outputs from both branches are passed through separate linear projection layers to obtain the ﬁnal embeddings. During buffer generation, distillation, and evaluation training, the encoders are frozen and only the projection layers are optimized. We collect 20 expert trajectories, each consisting of 10 training epochs. The hyperparameter settings follow those used in Lo RS [55] and can be found in Table 7 and Table 8 in Appendix F.