JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling

Authors: Jingyang Zhang, Shiwei Li, Yuanxun Lu, Tian Fang, David Neil McKinnon, Yanghai Tsin, Long Quan, Yao Yao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTS", "Quantitatively, we compare the Fr echet inception distance (FID) (Heusel et al., 2017), inception score (IS) (Salimans et al., 2016) and CLIP similarity (Radford et al., 2021) of the generated RGB over a prompt collection of size 30K sampled from the MSCOCO (Lin et al., 2014) validation set. The results are listed in Tab. 1.
Researcher Affiliation Collaboration 1Apple, 2The Hong Kong University of Science and Technology, 3Nanjing University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets Yes We perform training on the COYO-700M dataset (Byeon et al., 2022) containing image-caption pairs as well as various metadata including properties like image size and derived evaluations such as CLIP (Radford et al., 2021) similarity, watermark score and aesthetic score (Schuhmann, 2022).
Dataset Splits Yes Quantitatively, we compare the Fr echet inception distance (FID) (Heusel et al., 2017), inception score (IS) (Salimans et al., 2016) and CLIP similarity (Radford et al., 2021) of the generated RGB over a prompt collection of size 30K sampled from the MSCOCO (Lin et al., 2014) validation set." and "We prepare 100 text prompts sampled from COCO validation set and let user choose the overall best generated RGBD image...
Hardware Specification Yes We train the network on 64 NVidia A100 80G GPUs for around 24 hours.
Software Dependencies No The paper mentions specific models and tools used (e.g., Mi Da S v2, Omnidata, Stable Diffusion, Deepfloyd-IF) but does not provide version numbers for underlying software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes The sample resolution is 512x512 and the batch size is 4 on each GPU. The model is trained with a learning rate of 1e-4 for 10000 steps with 1000 warmup steps. We adopt a probability of 15% to drop the text conditioning (Ho & Salimans, 2022) and apply noise offset (Lin et al., 2023) of 0.05. The original parameters in the RGB branch are frozen.