MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation

Authors: Chuanxia Zheng, Tung-Long Vuong, Jianfei Cai, Dinh Phung

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on two benchmark datasets demonstrate that our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.
Researcher Affiliation Collaboration Chuanxia Zheng Monash University chuanxiazheng@gmail.com Long Tung Vuong Vin AI longvt94@gmail.com Jianfei Cai Monash University Jianfei.Cai@monash.edu Dinh Phung Monash University dinh.phung@monash.edu
Pseudocode No The paper describes the model architecture and training process in text and diagrams, but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code No The authors use an existing open-source implementation as a baseline (VQGAN) and provide a link to it, but they do not explicitly state that their own modified code for Mo VQ is open-source or provide a link to it. The checklist also states N/A for including the code.
Open Datasets Yes Datasets. To evaluate the proposed method, we instantiated Mo VQ on both unconditional and class-conditional image generation tasks, with FFHQ [20] and Image Net [33] respectively.
Dataset Splits Yes Quantitative reconstruction results on the validation splits of Image Net [33] (50,000 images) and FFHQ [20] (10,000 images).
Hardware Specification Yes We trained all models with a batch size of 48 across 4 Tesla V100 GPUs with 40 epochs for this stage.
Software Dependencies No The paper refers to a baseline implementation (VQGAN via `taming-transformers` GitHub), but does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We trained all models with a batch size of 48 across 4 Tesla V100 GPUs with 40 epochs for this stage. For each dataset, we only trained a single scale quantizer with a codebook Z R1024×64, i.e. 1024 codevectors each with 64 dimensions, on 256×256 images for all experiments. Our encoder-decoder pipeline is built upon the original VQGAN1, except the only difference that we replaced the original Group Normalization with the proposed spatially conditional normalization layer. 24 layers, 16 attention heads, 1024 embedding dimensions and 4096 hidden dimensions. batch size of 64 across 4 Tesla V100 GPUs with 200 epochs.