Reject Decoding via Language-Vision Models for Text-to-Image Synthesis
Authors: Fuxiang Wu, Liu Liu, Fusheng Hao, Fengxiang He, Lei Wang, Jun Cheng
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments conducted on the MS-COCO dataset and large-scale datasets show that the proposed reject decoding algorithm can exclude the useless paths and enlarge the searching paths to improve the synthesizing quality by consuming less time. We conduct extensive experiments with a base model trained on the MS-COCO dataset and a large-scale model trained on large-scale datasets to verify the efficiency of the reject decoding algorithm and the effectiveness of the multimodal vision models. |
| Researcher Affiliation | Collaboration | Fuxiang Wu1,2, Liu Liu3, Fusheng Hao1,2, Fengxiang He4, Lei Wang1, 2, Jun Cheng 1,2 1 Guangdong Provincial Key Laboratory of Robotics and Intelligent System, Shenzhen Institute of Advanced Technology, CAS, China 2 The Chinese University of Hong Kong, Hong Kong, China 3 School of Computer Science, Faculty of Engineering, The University of Sydney, Australia 4 JD Explore Academy, JD.com Inc., Beijing, China |
| Pseudocode | Yes | Algorithm 1: Original Decoding in Transformer Algorithm 2: Reject Decoding in Transformer Algorithm 3: Searching Reject Threshold |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the availability of the authors' source code for their proposed methodology. |
| Open Datasets | Yes | We conduct extensive experiments with a base model trained on the MS-COCO dataset (Lin et al. 2014) as the normal model denoted by the superscript coco. To verify the experiments on large-scale datasets, we exploit the large-scale pre-trained RQ-Transformer with 3.9B parameters2 denoted by the superscript pre , which is trained by CC-3M3, CC-12M4, and YFCC-subset5. 2github.com/kakaobrain/rq-vae-transformer 3github.com/google-research-datasets/conceptual-captions 4github.com/google-research-datasets/conceptual-12m 5github.com/openai/CLIP/blob/main/data/yfcc100m.md |
| Dataset Splits | No | The paper mentions training on datasets but does not provide specific details on how the data was split into training, validation, and test sets, nor does it explicitly mention a validation set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. It only mentions 'the same device' when discussing consuming time. |
| Software Dependencies | No | The paper mentions using GPT2 and RQ-Transformer but does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers with their specific versions). |
| Experiment Setup | Yes | In Algorithm 2, we choose the group size M = 8, and the total size of tokens is 64. Thus, we construct 8 similarity models as {Mi}8 i=1. In Figure 4, the results demonstrate the influence of Ne with Nb = 20. The multimodal vision models, consisted of 8 layers with 4 heads. |