Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment
Authors: Huayu Chen, Hang Su, Peize Sun, Jun Zhu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning ( 1% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. |
| Researcher Affiliation | Collaboration | Huayu Chen1, Hang Su1, Peize Sun2, Jun Zhu1,3 1Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 2The University of Hong Kong 3Shengshu Technology, Beijing |
| Pseudocode | Yes | Pseudo code in Appendix D. |
| Open Source Code | Yes | Code and models: https://github.com/thu-ml/CCA. We submit our source code in the supplementary material. Code and model weights are publicly accessible. |
| Open Datasets | Yes | Though both are class-conditioned models pretrained on Image Net, Llama Gen and VAR feature distinctively different tokenizer and architecture designs. We leverage CCA to finetune multiple Llama Gen and VAR models of various sizes on the standard Image Net dataset. |
| Dataset Splits | Yes | We leverage CCA to finetune multiple Llama Gen and VAR models of various sizes on the standard Image Net dataset. |
| Hardware Specification | Yes | We use a mix of NVIDIA-H100, NVIDIA A100, and NVIDIA A40 GPU cards for training. |
| Software Dependencies | No | The paper does not explicitly mention specific version numbers for software libraries or dependencies. It refers to specific models (Llama Gen and VAR) but not the underlying software stack with versions. |
| Experiment Setup | Yes | The training scheme and hyperparameters are mostly consistent with the pretraining phase. We report performance numbers after only one training epoch and find this to be sufficient for ideal performance. We fix β = 0.02 in Eq. 12 and select suitable λ for each model. Image resolutions are 384 384 for Llama Gen and 256 256 for VAR. Following the original work, we resize Llama Gen samples to 256 256 whenever required for evaluation. Table 4 reports hyperparameters for chosen models in Figure 1 and Figure 6. All models are fine-tuned for 1 epoch on the Image Net dataset. Batch size 256, Learning rate (1e-5 or 2e-5). |