Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Authors: Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, Yonglong Tian

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To investigate these hypotheses, we conduct a comprehensive empirical study on the scaling behavior of autoregressive models in the context of text-to-image generation. Specifically, we explore two key factors: whether the model operates on continuous or discrete tokens, and whether tokens are generated in a random or fixed raster order. To this end, we utilize the Diffusion Loss (Li et al., 2024) to make autoregressive models compatible with continuous tokens. We generalize BERT-like vision model Mask GIT (Chang et al., 2022) as random-order autoregression, as it conceptually predicts output tokens in a randomized order while retaining the autoregressive nature of predicting next tokens based on known ones . We analyze the behavior of four autoregressive variants, each employing different combinations of these two factors. We scale their parameters from 150M to 3B and evaluate their performance using three metrics: validation loss, FID (Heusel et al., 2017), and Gen Eval score (Ghosh et al., 2024). We also inspect the visual quality of the generated images.
Researcher Affiliation Collaboration Lijie Fan1,* Tianhong Li2, Siyang Qin1, Yuanzhen Li1 Chen Sun1 Michael Rubinstein1 Deqing Sun1 Kaiming He2 Yonglong Tian1,* 1Google Deep Mind 2MIT * equal contribution, project lead equal contribution
Pseudocode No The paper describes the model architecture and training process in detail, including mathematical formulations like p(x1, ..., xn) = i=1 p(xi | x1, ..., xi 1) in Section 3 and explains different autoregressive orders in Figure 2. However, it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code No Reproducibility Statement. To aid reproducibility, we have provided the implementation details of our framework in Section 4, training hyper-parameters in Section 5, and model configurations in the appendix. For the diffusion loss used for continuous tokens, we have strictly followed the open-sourced code of Li et al. (2024). This statement indicates that details are provided and that they followed code from another paper, but it does not explicitly state that the code for *their* work (Fluid model) is being released or made publicly available.
Open Datasets Yes Dataset. We use a subset of the Web LI (Web Language Image) dataset (Chen et al., 2022) as our training set, which consists of image-text pairs from the web with high scores for both image quality and alt-text relevance. By default, the images are center-cropped and resized to 256 256. ... FID is computed over 30K randomly selected image-text pairs from the MS-COCO 2014 training set, providing a metric that evaluates both the fidelity and diversity of generated images.
Dataset Splits No The paper states: "We use a subset of the Web LI (Web Language Image) dataset (Chen et al., 2022) as our training set". It also mentions: "We evaluate the validation loss on 30K images from the MS-COCO 2014 training set, as well as two widely-adopted metrics: zero-shot Frechet Inception Distance (FID) on MS-COCO". While it specifies 30K images from MS-COCO 2014 training set are used for validation and FID, it does not provide explicit train/test/validation splits for the Web LI dataset used for training, nor how the 30K MS-COCO images relate to a split of a larger dataset used for their main training.
Hardware Specification Yes The Fluid largest model, with 10.5B parameters, further improves the zero-shot FID on MS-COCO to 6.16 and increases the Gen Eval overall score to 0.692, with a speed of 1.571 seconds per image per TPU (evaluated on 32 TPU v5 with a batch size of 2048).
Software Dependencies No The paper mentions several components like "T5-XXL encoder (Raffel et al., 2020)", "Sentence Piece (Kudo, 2018)", and "Adam W optimizer (β1 = 0.9, β2 = 0.95) (Loshchilov & Hutter, 2019)". It also states that for diffusion loss, they "strictly followed the open-sourced code of Li et al. (2024)". However, it does not provide specific version numbers for any of these software dependencies or libraries used in their implementation.
Experiment Setup Yes Training. Unless otherwise specified, we use the Adam W optimizer (β1 = 0.9, β2 = 0.95) (Loshchilov & Hutter, 2019) with a weight decay of 0.02 to train each model for 1M steps with a batch size of 2048. This is equivalent to approximately 3 epochs on our dataset. For continuous tokens, we employ a constant learning rate schedule with a 65K-step linear warmup and a maximum learning rate of 1 10 4; for discrete tokens, we use a cosine learning rate schedule as we find it to be better. For training the random-order models, we randomly sample the masking ratio from [0, 1] following a cosine schedule, similar to Mask GIT (Chang et al., 2022), to mask each image. For all models, exponential moving average of the weights are gathered by a decay rate of 0.9999 and then used for evaluation. Inference. ... For random-order models, we use 64 steps for generation with a cosine schedule (Chang et al., 2022). To further enhance generation performance, we apply temperature and classifier-free guidance, as is commonly practiced. ... we trained a model with 10.5B parameters and a batch size of 4096 for 1M steps