MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation
Authors: Chuanxia Zheng, Tung-Long Vuong, Jianfei Cai, Dinh Phung
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on two benchmark datasets demonstrate that our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation. |
| Researcher Affiliation | Collaboration | Chuanxia Zheng Monash University chuanxiazheng@gmail.com Long Tung Vuong Vin AI longvt94@gmail.com Jianfei Cai Monash University Jianfei.Cai@monash.edu Dinh Phung Monash University dinh.phung@monash.edu |
| Pseudocode | No | The paper describes the model architecture and training process in text and diagrams, but does not provide pseudocode or a clearly labeled algorithm block. |
| Open Source Code | No | The authors use an existing open-source implementation as a baseline (VQGAN) and provide a link to it, but they do not explicitly state that their own modified code for Mo VQ is open-source or provide a link to it. The checklist also states N/A for including the code. |
| Open Datasets | Yes | Datasets. To evaluate the proposed method, we instantiated Mo VQ on both unconditional and class-conditional image generation tasks, with FFHQ [20] and Image Net [33] respectively. |
| Dataset Splits | Yes | Quantitative reconstruction results on the validation splits of Image Net [33] (50,000 images) and FFHQ [20] (10,000 images). |
| Hardware Specification | Yes | We trained all models with a batch size of 48 across 4 Tesla V100 GPUs with 40 epochs for this stage. |
| Software Dependencies | No | The paper refers to a baseline implementation (VQGAN via `taming-transformers` GitHub), but does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We trained all models with a batch size of 48 across 4 Tesla V100 GPUs with 40 epochs for this stage. For each dataset, we only trained a single scale quantizer with a codebook Z R1024×64, i.e. 1024 codevectors each with 64 dimensions, on 256×256 images for all experiments. Our encoder-decoder pipeline is built upon the original VQGAN1, except the only difference that we replaced the original Group Normalization with the proposed spatially conditional normalization layer. 24 layers, 16 attention heads, 1024 embedding dimensions and 4096 hidden dimensions. batch size of 64 across 4 Tesla V100 GPUs with 200 epochs. |