Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Vision-Language Dataset Distillation

Authors: Xindi Wu, Byron Zhang, Zhiwei Deng, Olga Russakovsky

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we first describe the cross-modal retrieval test-bed in Sec. 4.1. We use it to evaluate our vision-language dataset co-distillation performance. We then compare our method to baseline coreset selection approaches and provide the key quantitative, qualitative results, and cross-architecture generalization results in Sec. 4.2. We further conduct a set of ablation studies in Sec. 4.3.
Researcher Affiliation	Collaboration	Xindi Wu1 Byron Zhang1 Zhiwei Deng2 Olga Russakovsky1 1Princeton University 2Google Deep Mind
Pseudocode	Yes	Algorithm 1 Bi-Trajectory Co-Distillation
Open Source Code	No	Website: https: // princetonvisualai. github. io/ multimodal_ dataset_ distillation/
Open Datasets	Yes	We evaluate our method on standard vision-language datasets: Flickr30K (Plummer et al., 2015) and COCO (Lin et al., 2014), which are widely used for image-text retrieval tasks. We use them for expert training (stage 1) and distillation (stage 2). We adopt the Karpathy split (Karpathy & Fei-Fei, 2015) for Flickr30K (29k/1k/1k) and COCO (113/5k/5k) for train/validation/test respectively.
Dataset Splits	Yes	We adopt the Karpathy split (Karpathy & Fei-Fei, 2015) for Flickr30K (29k/1k/1k) and COCO (113/5k/5k) for train/validation/test respectively.
Hardware Specification	Yes	For expert training, we train on a single RTX 3090 GPU for 10 epochs, where a single epoch takes 40 minutes of wall-clock time. ...it takes 6 15 GPU hours depending on the settings (e.g. number of distilled pairs) with a 8-GPU A6000 node.
Software Dependencies	No	The paper mentions several models (BERT, NFNet, ViT), and an optimizer (SGD) but does not provide specific version numbers for any software libraries or frameworks like PyTorch, TensorFlow, or Python itself, which are necessary for full reproducibility.
Experiment Setup	Yes	For expert training, we train on a single RTX 3090 GPU for 10 epochs... We initialize a trainable learning rate α at 0.1 for the student model. We followed the data augmentation techniques in (Li et al., 2022), including resizing, cropping, flipping, and Random Augment. We use SGD with momentum=0.5, the learning rate for updating α, distilled image pixels, and distilled text embeddings are 1e-02, 1000, and 1000, respectively.