Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

Authors: Xuantong Liu, Tianyang Hu, Wenjia Wang, Kenji Kawaguchi, Yuan Yao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on T2I-Comp Bench (Huang et al., 2023), a comprehensive benchmark for open-world compositional text-to-image generation consisting of attribute binding, object relationships, and complex compositions. The results are presented in Table 3.
Researcher Affiliation Collaboration 1The Hong Kong University of Science and Technology 2Huawei Noah s Ark Lab 3Hong Kong University of Science and Technology (Guangzhou) 4National University of Singapore.
Pseudocode Yes The overall algorithm is depicted in Algorithm 1. Algorithm 1 Conditional Generation via VLM Inverion and SDS
Open Source Code Yes The code is available at https://github.com/Pepperlll/VLMinv.
Open Datasets Yes We evaluate our method on T2I-Comp Bench (Huang et al., 2023), a comprehensive benchmark for open-world compositional text-to-image generation... We generated 10K images based on the captions randomly selected from COCO dataset and calculated the FID score.
Dataset Splits No The paper mentions evaluating on T2I-Comp Bench and using ablation studies, but it does not specify explicit training/validation/test dataset splits (e.g., percentages, sample counts) or refer to predefined splits from standard benchmarks for reproducibility.
Hardware Specification Yes All experiments are conducted on the Tesla V100 GPUs.
Software Dependencies No The paper mentions using pre-trained models (BLIP-2, Stable Diffusion) and optimizers (Adam, SGD) but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes We optimize the randomly initialized z for 160 iterations. In the first 150 iterations, z is firstly updated using gradients provided by Lalign backpropagation via an Adam optimizer. Subsequently, it is further refined using gradients provided by SDS through an SGD optimizer without momentum. The norm of the gradient from BLIP-2 is always twice the gradient from SDS (w1 = 2); a noise ϵz N(0, 0.22) is added on z. In the last 10 iterations, z is updated solely based on the gradients from SDS. We initialize the learning rate at 1.0, which then gradually diminishes to 0.5 following a cosine decay schedule. The SDS weight (w2) gradually decays from 800 to 400 following a cosine decay rate. We also implement EMA-restart at 40 and 100 iterations.