Reject Decoding via Language-Vision Models for Text-to-Image Synthesis

Authors: Fuxiang Wu, Liu Liu, Fusheng Hao, Fengxiang He, Lei Wang, Jun Cheng

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments conducted on the MS-COCO dataset and large-scale datasets show that the proposed reject decoding algorithm can exclude the useless paths and enlarge the searching paths to improve the synthesizing quality by consuming less time. We conduct extensive experiments with a base model trained on the MS-COCO dataset and a large-scale model trained on large-scale datasets to verify the efficiency of the reject decoding algorithm and the effectiveness of the multimodal vision models.
Researcher Affiliation Collaboration Fuxiang Wu1,2, Liu Liu3, Fusheng Hao1,2, Fengxiang He4, Lei Wang1, 2, Jun Cheng 1,2 1 Guangdong Provincial Key Laboratory of Robotics and Intelligent System, Shenzhen Institute of Advanced Technology, CAS, China 2 The Chinese University of Hong Kong, Hong Kong, China 3 School of Computer Science, Faculty of Engineering, The University of Sydney, Australia 4 JD Explore Academy, JD.com Inc., Beijing, China
Pseudocode Yes Algorithm 1: Original Decoding in Transformer Algorithm 2: Reject Decoding in Transformer Algorithm 3: Searching Reject Threshold
Open Source Code No The paper does not provide a direct link or explicit statement about the availability of the authors' source code for their proposed methodology.
Open Datasets Yes We conduct extensive experiments with a base model trained on the MS-COCO dataset (Lin et al. 2014) as the normal model denoted by the superscript coco. To verify the experiments on large-scale datasets, we exploit the large-scale pre-trained RQ-Transformer with 3.9B parameters2 denoted by the superscript pre , which is trained by CC-3M3, CC-12M4, and YFCC-subset5. 2github.com/kakaobrain/rq-vae-transformer 3github.com/google-research-datasets/conceptual-captions 4github.com/google-research-datasets/conceptual-12m 5github.com/openai/CLIP/blob/main/data/yfcc100m.md
Dataset Splits No The paper mentions training on datasets but does not provide specific details on how the data was split into training, validation, and test sets, nor does it explicitly mention a validation set.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. It only mentions 'the same device' when discussing consuming time.
Software Dependencies No The paper mentions using GPT2 and RQ-Transformer but does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers with their specific versions).
Experiment Setup Yes In Algorithm 2, we choose the group size M = 8, and the total size of tokens is 64. Thus, we construct 8 similarity models as {Mi}8 i=1. In Figure 4, the results demonstrate the influence of Ne with Nb = 20. The multimodal vision models, consisted of 8 layers with 4 heads.