Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing

Authors: Ling Yang, Zhilong Zhang, Zhaochen Yu, Jingwei Liu, Minkai Xu, Stefano Ermon, Bin CUI

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing. In each task, our CONTEXTDIFF achieves new state-of-the-art performance, significantly enhancing the semantic alignment between text condition and generated samples, as evidenced by quantitative and qualitative evaluations.
Researcher Affiliation Academia Ling Yang1 Zhilong Zhang1 Zhaochen Yu1 Jingwei Liu1 Minkai Xu2 Stefano Ermon2 Bin Cui1 1Peking University 2Stanford University
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/Yang Ling0818/Context Diff
Open Datasets Yes Following Rombach et al. (2022); Saharia et al. (2022b), we use public LAION-400M (Schuhmann et al., 2021), a dataset with CLIP-filtered 400 million image-text pairs for training CONTEXTDIFF. [...] we use 42 representative videos taken from DAVIS dataset (Pont-Tuset et al., 2017)
Dataset Splits Yes Following previous works (Rombach et al., 2022; Ramesh et al., 2022; Saharia et al., 2022b), we make quantitative evaluations CONTEXTDIFF on the MSCOCO dataset using zero-shot FID score, which measures the quality and diversity of generated images. Similar to Rombach et al. (2022); Ramesh et al. (2022); Saharia et al., 2022b, 30,000 images are randomly selected from the validation set for evaluation.
Hardware Specification No The paper mentions computational costs and efficiency in Table 5 but does not specify the exact hardware (e.g., specific GPU/CPU models) used for running its experiments.
Software Dependencies No The paper mentions various models and optimizers (e.g., CLIP, U-Net, Adam W) and references other methods' repositories, but it does not provide specific version numbers for its own software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We provide detailed hyper-parameters in training CONTEXTDIFF for text-to-image generation (in Tab. 7) and text-to-video editing (in Tab. 8). Table 7: T 1000, Noise schedule cosine, Number of transformer blocks for cross-modal interactions 4, Betas of Adam W (0.9, 0.999), Weight decay 0.0, Learning rate 1e-4, Linear warmup steps 10000, Batch size 1024. Table 8: T 20, Noise schedule linear, Number of transformer blocks for cross-modal interactions 4, Betas of Adam W (0.9, 0.999), Weight decay 1e-2, Learning rate 1e-5, Warmup steps 0, Use checkpoint True, Batch size 1, Number of frames 8 24, Sampling rate 2.