Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Feature Distillation Improves Zero-Shot Transfer from Synthetic Images
Authors: Niclas Popp, Jan Hendrik Metzen, Matthias Hein
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we show that small replacements of the CLIP vision encoder can be efficiently and robustly trained using feature distillation on synthetic images. Therefore, we introduce a unifying framework for training vision encoders in a zero-shot setting. The main results are stated in Figure 1 and our key findings within this framework are summarized as follows: 1. Feature Distillation is Less Susceptible to Spurious Visual Features Than Vision Language Distillation. [...] 4. Feature Distillation Bridges the Gap to Baselines Trained on Real Images. Based on our first three findings, we distill a Vi T-B/32 CLIP vision encoder into students based on the Tiny Vi T (Wu et al., 2022) and Efficient Net Tan & Le (2019) architectures with up to 92% fewer parameters using feature distillation on synthetic images. The resulting students closely match the classification performance of the teacher on the Oxford Pets (Parkhi et al., 2012), Flowers-102 (Nilsback & Zisserman, 2008), Stanford Cars (Krause et al., 2013), Food-101 (Bossard et al., 2014), Describable Textures (Cimpoi et al., 2014) and Aircrafts (Maji et al., 2013) datasets. Notably, our students are on par or even surpass the current baselines for distilled CLIP models, including the Tiny CLIP model with 8 times more trainable parameters and Mobile CLIP which was trained on over 100 times more images using stronger teachers. |
| Researcher Affiliation | Collaboration | Niclas Popp Bosch Center for Artificial Intelligence, Robert Bosch GmbH, University of Tübingen EMAIL Jan Hendrik Metzen Bosch Center for Artificial Intelligence, Robert Bosch GmbH Matthias Hein University of Tübingen EMAIL |
| Pseudocode | No | The paper includes mathematical formulations for loss functions (Section A.15 Multi-Positive Contrastive Loss) and theoretical bounds (Section A.16 Theoretical Bound on Teacher-Student Agreement), but no explicitly structured pseudocode or algorithm blocks are present. |
| Open Source Code | No | The paper does not provide any explicit statements about code release or links to a code repository. |
| Open Datasets | Yes | For this purpose, we select Data Comp medium (Gadre et al., 2023) with 123 million images [...] For comparison, we perform to domain-agnostic distillation on Image Net (Deng et al., 2009) with 1.28 million images and Synth CI 30M Hammoud et al. (2024) with 30 million synthetic images in Section 5.3. For domain-specific distillation, we target the Oxford Pets (Parkhi et al., 2012), Oxford Flowers (Nilsback & Zisserman, 2008), Food-102 (Bossard et al., 2014), Stanford Cars (Krause et al., 2013), Describable Textures (Cimpoi et al., 2014) and Aircrafts (Maji et al., 2013) datasets. In the appendix, we include Image Net-100 (Tian et al., 2020) as a non-domain-specific dataset for reference. |
| Dataset Splits | Yes | These datasets are only used for testing while the actual datasets used for training are synthetically generated based on the class names. [...] The number of images per class roughly matches the size of the real training datasets. We use 265 images per class for the smaller, less diverse datasets and 1011 for the larger ones. More details on the selection of contextual dimension and the dataset sizes are given in Section A.3. [...] Table 8: Overview over the size of the real target datasets. Dataset #classes #training images #test images Pets 37 3680 3669 Flowers 102 1020 6149 Cars 196 8144 8041 Food 101 75750 25250 Texture 47 1880 1880 Aircraft 100 3334 3333 Image Net-100 100 130000 5000 |
| Hardware Specification | No | The paper does not provide specific details on the GPU models, CPU models, or other hardware used for running the experiments. |
| Software Dependencies | No | For the generation of the images, we utilize a LCM Lo RA (Luo et al., 2023) of Stable Diffusion XL (Podell et al., 2023). [...] As the selection of options for the contextual dimensions and superclasses are relatively simple, we can use a smaller language model Llama-2 7B fine-tuned for chats (Touvron et al., 2023) and still obtain sufficiently diverse prompts. [...] We train using a batch size of 256 and a constant learning rate of 5e-4 using the Adam W optimizer (Loshchilov & Hutter, 2019). The paper mentions specific models and optimizers but does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We train using a batch size of 256 and a constant learning rate of 5e-4 using the Adam W optimizer (Loshchilov & Hutter, 2019). All other hyperparameters and augmentations were kept consistent with the CLIP training methodology (Radford et al., 2021). One epoch of training on Data Comp medium corresponds to 4.3e5 optimization steps. For domain-specific distillation, we perform 96 optimization epochs for all models. |