Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder

Authors: Yongmin Lee, Hye Won Chung

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluated on Flickr30K and COCO, Cov Match outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs. Our code is available at https://github.com/Yongalls/Cov Match. We evaluate Cov Match on image-text retrieval tasks using the Flickr30K [35] and COCO [26] benchmarks. Cov Match consistently outperforms state-of-the-art multimodal distillation methods, including MTT-VL [43] and Lo RS [45]. Remarks on Figure 2, 4, 5, 6, 7, 8 provide visualizations of experimental results. Table 2, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 are all experimental results. Section 4 is titled Experimental Results.
Researcher Affiliation Academia Yongmin Lee School of Electrical Engineering KAIST EMAIL Hye Won Chung School of Electrical Engineering KAIST EMAIL
Pseudocode Yes Algorithm 1 Dataset Distillation via Cross-Covariance Matching (Cov Match) 1: Require: Full training set T , pretrained weights θpretrained v for image encoder fv, pretrained weights θpretrained l for text encoder fl, the learning rate α for the distilled data 2: repeat 3: Initialize θv, θl θpretrained v , θpretrained l 4: Randomly initialize projection layers Gv, Gl 5: for t = 0 to T 1 do 6: Sample mini-batch pairs BT T and BS S 7: Compute the matching loss LCov Match (13) on (BT , BS) 8: Update S S α SLCov Match 9: Train the model (θv, θl, Gv, Gl) with T for one step. 10: end for 11: until convergence 12: Output: S
Open Source Code Yes Our code is available at https://github.com/Yongalls/Cov Match.
Open Datasets Yes We evaluate Cov Match on image-text retrieval tasks using the Flickr30K [35] and COCO [26] benchmarks. Both datasets consist of image-caption pairs: Flickr30K contains approximately 31K images and COCO contains 123K images, each annotated with five captions. We also perform experiments on the Web Vid-10M dataset [2] in Appendix D.
Dataset Splits Yes We adopt the Karpathy split [18] for both datasets, yielding train/validation/test splits of 29K/1K/1K for Flickr30K and 113K/5K/5K for COCO, respectively.
Hardware Specification Yes Resource requirements for long-term trajectory matching methods on a single A100 80GB GPU. For instance, storing expert trajectories for large backbones such as NFNet (image encoder) and BERT (text encoder) can require over 120GB of storage and 5 days of training on a single A100 GPU.
Software Dependencies No The paper mentions using an 'Image Net-pretrained Normalizer-Free Res Net (NFNet) [4] as the image encoder and a pretrained BERT-base model [9] as the text encoder'. It also mentions 'SGD optimizer'. However, no specific software versions (e.g., Python, PyTorch, TensorFlow, CUDA versions) are provided.
Experiment Setup Yes Distillation: The synthetic dataset is initialized with randomly selected real image-text pairs from the training set. During distillation, both the synthetic image pixels and text input embeddings are optimized using SGD with momentum 0.5 and a learning rate of 1.0. At each distillation step, the cross-covariance matrix and feature means are computed using a batch of 128 real samples. For the synthetic data, the entire set is used for these computations, except in the 500-pair setting, where a batch of 256 synthetic samples is used to reduce memory consumption. We set the scaling factor to ρ = 2 for 100 synthetic pairs and ρ = 1 for 200 or more pairs. The feature matching weight λ is fixed at 0.1 for 100 and 200 pairs, and increased to 0.5 or 0.6 for 500 pairs to impose stronger regularization on cross-covariance alignment. Note that all network components including the image encoder, text encoder, and projection layers are updated with one step of training on the real dataset at each distillation step, and re-initialized every 50 updates. We distill for 10,000 iterations by default; for the 500-pair setting, we extend this to 20,000 iterations to ensure full convergence, even after reaching 95% of the final performance. A summary of the hyperparameter used in Cov Match is provided in Table 4. Evaluation: During the evaluation stage, we train the model using SGD optimizer with momentum 0.9, weight decay 5e-4, batch size 128, and learning rate 0.01 for the image and text encoders and 0.1 for the projection layers. Training is conducted for 100 epochs, and we employ a multi-step learning rate scheduler that decays the learning rate by a factor of 0.1 at the 50th epoch.