Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Vision-Language Dataset Distillation

Authors: Xindi Wu, Byron Zhang, Zhiwei Deng, Olga Russakovsky

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we first describe the cross-modal retrieval test-bed in Sec. 4.1. We use it to evaluate our vision-language dataset co-distillation performance. We then compare our method to baseline coreset selection approaches and provide the key quantitative, qualitative results, and cross-architecture generalization results in Sec. 4.2. We further conduct a set of ablation studies in Sec. 4.3.
Researcher Affiliation Collaboration Xindi Wu1 Byron Zhang1 Zhiwei Deng2 Olga Russakovsky1 1Princeton University 2Google Deep Mind
Pseudocode Yes Algorithm 1 Bi-Trajectory Co-Distillation
Open Source Code No Website: https: // princetonvisualai. github. io/ multimodal_ dataset_ distillation/
Open Datasets Yes We evaluate our method on standard vision-language datasets: Flickr30K (Plummer et al., 2015) and COCO (Lin et al., 2014), which are widely used for image-text retrieval tasks. We use them for expert training (stage 1) and distillation (stage 2). We adopt the Karpathy split (Karpathy & Fei-Fei, 2015) for Flickr30K (29k/1k/1k) and COCO (113/5k/5k) for train/validation/test respectively.
Dataset Splits Yes We adopt the Karpathy split (Karpathy & Fei-Fei, 2015) for Flickr30K (29k/1k/1k) and COCO (113/5k/5k) for train/validation/test respectively.
Hardware Specification Yes For expert training, we train on a single RTX 3090 GPU for 10 epochs, where a single epoch takes 40 minutes of wall-clock time. ...it takes 6 15 GPU hours depending on the settings (e.g. number of distilled pairs) with a 8-GPU A6000 node.
Software Dependencies No The paper mentions several models (BERT, NFNet, ViT), and an optimizer (SGD) but does not provide specific version numbers for any software libraries or frameworks like PyTorch, TensorFlow, or Python itself, which are necessary for full reproducibility.
Experiment Setup Yes For expert training, we train on a single RTX 3090 GPU for 10 epochs... We initialize a trainable learning rate α at 0.1 for the student model. We followed the data augmentation techniques in (Li et al., 2022), including resizing, cropping, flipping, and Random Augment. We use SGD with momentum=0.5, the learning rate for updating α, distilled image pixels, and distilled text embeddings are 1e-02, 1000, and 1000, respectively.