JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling
Authors: Jingyang Zhang, Shiwei Li, Yuanxun Lu, Tian Fang, David Neil McKinnon, Yanghai Tsin, Long Quan, Yao Yao
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS", "Quantitatively, we compare the Fr echet inception distance (FID) (Heusel et al., 2017), inception score (IS) (Salimans et al., 2016) and CLIP similarity (Radford et al., 2021) of the generated RGB over a prompt collection of size 30K sampled from the MSCOCO (Lin et al., 2014) validation set. The results are listed in Tab. 1. |
| Researcher Affiliation | Collaboration | 1Apple, 2The Hong Kong University of Science and Technology, 3Nanjing University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology. |
| Open Datasets | Yes | We perform training on the COYO-700M dataset (Byeon et al., 2022) containing image-caption pairs as well as various metadata including properties like image size and derived evaluations such as CLIP (Radford et al., 2021) similarity, watermark score and aesthetic score (Schuhmann, 2022). |
| Dataset Splits | Yes | Quantitatively, we compare the Fr echet inception distance (FID) (Heusel et al., 2017), inception score (IS) (Salimans et al., 2016) and CLIP similarity (Radford et al., 2021) of the generated RGB over a prompt collection of size 30K sampled from the MSCOCO (Lin et al., 2014) validation set." and "We prepare 100 text prompts sampled from COCO validation set and let user choose the overall best generated RGBD image... |
| Hardware Specification | Yes | We train the network on 64 NVidia A100 80G GPUs for around 24 hours. |
| Software Dependencies | No | The paper mentions specific models and tools used (e.g., Mi Da S v2, Omnidata, Stable Diffusion, Deepfloyd-IF) but does not provide version numbers for underlying software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The sample resolution is 512x512 and the batch size is 4 on each GPU. The model is trained with a learning rate of 1e-4 for 10000 steps with 1000 warmup steps. We adopt a probability of 15% to drop the text conditioning (Ho & Salimans, 2022) and apply noise offset (Lin et al., 2023) of 0.05. The original parameters in the RGB branch are frozen. |