Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing
Authors: Ling Yang, Zhilong Zhang, Zhaochen Yu, Jingwei Liu, Minkai Xu, Stefano Ermon, Bin CUI
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing. In each task, our CONTEXTDIFF achieves new state-of-the-art performance, significantly enhancing the semantic alignment between text condition and generated samples, as evidenced by quantitative and qualitative evaluations. |
| Researcher Affiliation | Academia | Ling Yang1 Zhilong Zhang1 Zhaochen Yu1 Jingwei Liu1 Minkai Xu2 Stefano Ermon2 Bin Cui1 1Peking University 2Stanford University |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/Yang Ling0818/Context Diff |
| Open Datasets | Yes | Following Rombach et al. (2022); Saharia et al. (2022b), we use public LAION-400M (Schuhmann et al., 2021), a dataset with CLIP-filtered 400 million image-text pairs for training CONTEXTDIFF. [...] we use 42 representative videos taken from DAVIS dataset (Pont-Tuset et al., 2017) |
| Dataset Splits | Yes | Following previous works (Rombach et al., 2022; Ramesh et al., 2022; Saharia et al., 2022b), we make quantitative evaluations CONTEXTDIFF on the MSCOCO dataset using zero-shot FID score, which measures the quality and diversity of generated images. Similar to Rombach et al. (2022); Ramesh et al. (2022); Saharia et al., 2022b, 30,000 images are randomly selected from the validation set for evaluation. |
| Hardware Specification | No | The paper mentions computational costs and efficiency in Table 5 but does not specify the exact hardware (e.g., specific GPU/CPU models) used for running its experiments. |
| Software Dependencies | No | The paper mentions various models and optimizers (e.g., CLIP, U-Net, Adam W) and references other methods' repositories, but it does not provide specific version numbers for its own software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We provide detailed hyper-parameters in training CONTEXTDIFF for text-to-image generation (in Tab. 7) and text-to-video editing (in Tab. 8). Table 7: T 1000, Noise schedule cosine, Number of transformer blocks for cross-modal interactions 4, Betas of Adam W (0.9, 0.999), Weight decay 0.0, Learning rate 1e-4, Linear warmup steps 10000, Batch size 1024. Table 8: T 20, Noise schedule linear, Number of transformer blocks for cross-modal interactions 4, Betas of Adam W (0.9, 0.999), Weight decay 1e-2, Learning rate 1e-5, Warmup steps 0, Use checkpoint True, Batch size 1, Number of frames 8 24, Sampling rate 2. |