Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features

Authors: Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Qingming Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a large-scale quantitative comparison of the discriminative ability of the activations. However, we find that many potential activations have not been evaluated, such as the queries and keys used to compute attention scores. Moreover, recent advancements in diffusion architectures bring many new activations, such as those within embedded Vi T modules. Both combined, activation selection remains unresolved but overlooked. To tackle this issue, this paper takes a further step with a much broader range of activations evaluated. Considering the significant increase in activations, a full-scale quantitative comparison is no longer operational. Instead, we seek to understand the properties of these activations, such that the activations that are clearly inferior can be filtered out in advance via simple qualitative evaluation. After careful analysis, we discover three properties universal among diffusion models, enabling this study to go beyond specific models. On top of this, we present effective feature selection solutions for several popular diffusion models. Finally, the experiments across multiple discriminative tasks validate the superiority of our method over the SOTA competitors.
Researcher Affiliation Academia 1Institute of Information Engineering, CAS 2School of Cyber Security, University of Chinese Academy of Sciences 3Key Lab. of Intelligent Information Processing, Institute of Computing Technology, CAS 4Peng Cheng Laboratory 5School of Cyber Science and Tech., Shenzhen Campus of Sun Yat-sen University 6School of Computer Science and Tech., University of Chinese Academy of Sciences 7Key Laboratory of Big Data Mining and Knowledge Management, CAS
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at this url.
Open Datasets Yes We conduct experiments on the SPair-71k dataset [30]. For this task, we choose the ADE20K dataset [59] with over 20k annotated images of 150 semantic categories, and the City Scapes dataset [8]... For this setting, we use a dataset collected in [2] and experiment on its Horse-21 subset, the data of which is sourced from LSUN [54].
Dataset Splits Yes PCK@0.1img( ) and PCK@0.1bbox( ) are used, following the widely-adopted protocol reported in [30]. (for SPair-71k dataset, implying predefined splits). For ADE20K, Cityscapes, and Horse-21 datasets, explicit train/validation/test splits are not detailed; only the size of the training set for Horse-21 (30 labeled images) is mentioned.
Hardware Specification Yes We use Nvidia(R) RTX 3090 and Nvidia(R) RTX 4090 GPUs for the experiments, all with 24GB VRAM.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used.
Experiment Setup Yes The model is trained for two epochs, each containing 5,000 sample pairs, following conventional settings. Our implementation is derived from DHF, and we keep all hyper-parameters at their default settings. Our implementation directly uses the hyper-parameters reported in Meta Prompts. Our implementation is derived from DDPM with only batch size changed among all hyper-parameters. We use a larger batch size for faster experiments as a smaller one does not improve performance.