Locally Hierarchical Auto-Regressive Modeling for Image Generation

Authors: Tackgeun You, Saehoon Kim, Chiheon Kim, Doyup Lee, Bohyung Han

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our models, referred to as HQ-TVAE hereafter, on class and text-conditional image generation tasks.
Researcher Affiliation Collaboration Tackgeun You3,5 tackgeun.you@postech.ac.kr Saehoon Kim4 shkim@kakaobrain.com Chiheon Kim4 chiheon.kim@kakaobrain.com Doyup Lee4 doyup.lee@kakaobrain.com Bohyung Han1,2,3 bhhan@snu.ac.kr 1ECE, 2IPAI, 3AIIS, Seoul National University, Korea 4Kakao Brain, Korea 5CSE, POSTECH, Korea
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We report in both the paper and the supplementary material.
Open Datasets Yes We train our models on 1.2M images in the train split of Image Net [29] for class-conditional image generation. For text-conditional tasks, we employ 15M image-text pairs in Conceptual Caption (CC) [30] and Conceptual-12M [31].
Dataset Splits Yes We train our models on 1.2M images in the train split of Image Net [29] for class-conditional image generation. For text-conditional tasks, we employ 15M image-text pairs in Conceptual Caption (CC) [30] and Conceptual-12M [31]. ... Table 2: Text-conditional image generation performance on the CC3M validation set. ... Table 3: Comparison of image reconstruction quality on the Image Net validation set.
Hardware Specification Yes We measure the throughput of sample generation on a single Tesla A100 GPU.
Software Dependencies No The paper describes network structures and normalization techniques but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The scaling factor in HQ-VAE is set to 2, i.e., r = 2, to produce the top and bottom codes of 8 8 and 16 16 resolution by default, respectively. We test three versions of HQ-Transformer by varying hyperparameters: (a) NMT = 12, and NPHT = 4 for the smallest model, (b) NMT = 24 and NPHT = 4 for the mid-size model, and (c) NMT = 42, and NPHT = 6 for our largest model. In all cases, NIET = 1 and other parameters related to the network size are fixed.