Muse: Text-To-Image Generation via Masked Generative Transformers

Authors: Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance... Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32.
Researcher Affiliation Industry Google Research
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code No More results and videos demonstrating editing are available at http://muse-icml.github.io and Due to these important considerations, we opt to not release code or a public demo at this point in time.
Open Datasets Yes We train on the Imagen dataset, consisting of 860M textimage pairs (Saharia et al., 2022). and In Table 1 and Table 2, we show our performance against other methods on the CC3M (Sharma et al., 2018) and COCO (Lin et al., 2014) datasets
Dataset Splits No The paper mentions using established datasets like Imagen, CC3M, and COCO for training and evaluation, but it does not provide specific details on how these datasets were split into training, validation, and test sets (e.g., exact percentages or sample counts for each split, or references to predefined split files).
Hardware Specification Yes Each image was generated in 1.4s on a TPUv4 chip.
Software Dependencies No The paper mentions using optimizers like Adafactor (Shazeer & Stern, 2018) and Adam (Kingma & Ba, 2015), and specific learning rate schedules (cosine decay), but does not provide details on software dependencies such as programming languages, libraries, or frameworks with their specific version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes Training is performed for 1M steps, with a batch size of 512 on 512-core TPU-v4 chips (Jouppi et al., 2020).