Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Beyond Modality Collapse: Representation Blending for Multimodal Dataset Distillation
Authors: xin zhang, Ziruo Zhang, JIAWEI DU, Zuozhu Liu, Joey Tianyi Zhou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Flickr-30K and MS-COCO show that Rep Blend consistently outperforms prior state-of-the-art MDD methods, achieving significant gains in retrieval performance (e.g., +9.4 IR@10, +6.3 TR@10 under the 100-pair setting) and offering up to 6.7 distillation speedup. |
| Researcher Affiliation | Academia | 1Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore 2Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore 3National University of Singapore, Singapore 4Zhejiang University, China EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Blending Representations to Mitigate Modality Collapse in MDD |
| Open Source Code | Yes | Our code is publicly available at https://github.com/zhangxin-xd/Rep Blend. |
| Open Datasets | Yes | Experiments on Flickr-30K and MS-COCO show that Rep Blend consistently outperforms prior state-of-the-art MDD methods [...] We evaluate our method on two widely-used image captioning datasets: Flickr-30K [36] and MS-COCO [27] [...] to the Audio Caps [23] audio-text benchmark [...] image encoder is initialized with Image Net-1K pretrained weights [8] [...] LLa VA-cc3m dataset [...] https://huggingface.co/xinxin66/Rep Blend/tree/main/datasets/cc3m [...] zero-shot Image Net [8] classification and OCR-relevant retrieval on Text Caps [45]. |
| Dataset Splits | Yes | We use approximately 60% of the data (about 334k pairs) for training and reserve a non-overlapping set of 10k pairs for validation4. |
| Hardware Specification | Yes | All experiments are conducted using two NVIDIA RTX 3090 GPUs and one NVIDIA H100 GPU. |
| Software Dependencies | No | The paper does not explicitly list software dependencies with version numbers in the main text. |
| Experiment Setup | Yes | Implementation Details. We construct a CLIP-style architecture using the aforementioned image and text encoders. The image encoder is initialized with Image Net-pretrained weights [8], while the text encoder is initialized with the official pretrained weights provided by the corresponding language model. After feature extraction, the outputs from both branches are passed through separate linear projection layers to obtain the final embeddings. During buffer generation, distillation, and evaluation training, the encoders are frozen and only the projection layers are optimized. We collect 20 expert trajectories, each consisting of 10 training epochs. The hyperparameter settings follow those used in Lo RS [55] and can be found in Table 7 and Table 8 in Appendix F. |