EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

Authors: Koichi Namekata, Amirmojtaba Sabour, Sanja Fidler, Seung Wook Kim

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate our produced segmentation masks both qualitatively and quantitatively. To quantitatively evaluate our segmentation masks, we apply our framework to two downstream tasks: unsupervised semantic segmentation and annotation-free open vocabulary segmentation.
Researcher Affiliation Collaboration 1University of Toronto, 2Vector Institute, 3NVIDIA
Pseudocode No The paper describes its methods in paragraph form and with mathematical equations but does not include any pseudocode or algorithm blocks.
Open Source Code No Project page: https://kmcode1.github.io/Projects/EmerDiff/. This link is a project page, not a direct link to a source-code repository.
Open Datasets Yes The effectiveness of our framework is extensively evaluated on multiple scene-centric datasets such as COCO-Stuff (Caesar et al., 2018), PASCAL-Context (Mottaghi et al., 2014), ADE20K (Zhou et al., 2019) and Cityscapes (Cordts et al., 2016)
Dataset Splits No The paper evaluates its framework on existing datasets (COCO-Stuff, PASCAL-Context, ADE20K, Cityscapes) using their ground truth annotations for evaluation, but does not specify a training/validation/test split for its own method or data.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments.
Software Dependencies No The paper mentions using 'official Stable Diffusion v1.4 checkpoint', which is a model version, but does not provide details for other software dependencies (e.g., programming language, libraries, or frameworks with specific version numbers) needed for reproducibility.
Experiment Setup Yes Throughout the experiments, we use the official Stable Diffusion v1.4 checkpoint with DDPM sampling scheme of 50 steps (for clarity purposes, we denote timesteps out of T = 1000). To generate low-resolution segmentation maps, we extract feature maps at timestep tf = 1 (minimum noise). We apply modulation to the third cross-attention layer of 16 16 upward blocks at timestep tm = 281 and λ = 10. (Also Section D 'HYPERPARAMETER ANALYSIS').