Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Open-Vocabulary Customization from CLIP via Data-Free Knowledge Distillation
Authors: Yongxian Wei, Zixuan Hu, Li Shen, Zhenyi Wang, Chun Yuan, Dacheng Tao
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments showcase the superiority of our approach across twelve customized tasks, achieving a 9.33% improvement compared to existing DFKD methods. |
| Researcher Affiliation | Academia | 1Tsinghua University, China 2Nanyang Technological University, Singapore 3Shenzhen Campus of Sun Yat-sen University, China 4University of Maryland, College Park, USA |
| Pseudocode | No | The paper includes equations and figures illustrating the framework (e.g., Figure 1), but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper mentions that DFKD 'elegantly resolves these issues with open-sourced pre-trained models' and that 'Our method is an open-vocabulary, customized approach suitable for any category recognized by CLIP.' However, it does not explicitly state that the authors are releasing the code for *their* described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We perform model inversion from texts sourced from datasets including Caltech-101 (Fei-Fei et al., 2004) (101 categories), Image Net-1K (Deng et al., 2009) (1000 categories), or Flower-102 (Nilsback & Zisserman, 2008) (102 fine-grained categories). |
| Dataset Splits | No | We randomly divide Image Net-1K into 10 splits to simulate a real customization scenario as closely as possible, reporting average results to demonstrate the robustness of our method. Each task includes over 100 categories encompassing a wide range of natural categories. Further details regarding data statistics are provided in App. H. The student model is evaluated on these datasets, including the test set of Image Net, and the complete datasets of Caltech-101 and Flower-102, with the classification accuracy (in %) reported. While the paper mentions random division into splits and evaluating on test sets, it does not provide specific percentages, sample counts, or explicit train/validation/test splits needed for reproduction beyond stating that Image Net-1K is divided into 10 splits and evaluated on its test set. |
| Hardware Specification | Yes | SDD is a preliminary step with low complexity, taking only 57 seconds to train on RTX 4090. |
| Software Dependencies | No | The paper mentions using VQGAN and CLIP, but does not specify version numbers for these or other software libraries/frameworks (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | The batch size for prompt learning is set to 64, with a learning rate of 0.01. Surrogate images are synthesized with a resolution of 224 × 224, and optimized using the Adam optimizer with a learning rate of 0.1 for 400 iterations. For text-based customization, 64 images are generated per class. For image-based customization, each class has 4 example images, and 24 additional images are synthesized per class. The inner loop learning rate α and outer loop learning rate for meta knowledge distillation are both set to 0.001, utilizing the SGD optimizer. |