LG-VQ: Language-Guided Codebook Learning

Authors: Liang Guotao, Baoquan Zhang, Yaowei Wang, Yunming Ye, Xutao Li, Wanghuaibin , Luo Chuyao, kolaye , luolinfeng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks. We conduct comprehensive experiments on four public datasets, which shows that our LG-VQ method outperforms various state-of-the-art models on reconstruction and various cross-modal tasks (e.g., text-to-image, image captioning, VQA).
Researcher Affiliation Collaboration 1Harbin Institute of Technology, Shenzhen, 2Peng Cheng Laboratory, 3Si Far Company
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We provide our code and dataset information in the supplementary material.
Open Datasets Yes We evaluate our method on four public datasets, including Text Caps [39], Celeb A-HQ [23], CUB-200 [45], and MS-COCO [20].
Dataset Splits Yes The refcoco dataset [49]... The train set has 42,404 expressions, the validation set has 3,811 expressions, and the test set has 3,785 expressions.
Hardware Specification Yes The semantic image synthesis experiments are conducted on 1 4090 GPU... The unconditional generation and image completion experiments are conducted on 2 4090 GPUs... The experiments are conducted on 1 4090 GPU... The experiments are conducted on 2 4090 GPUs...
Software Dependencies No The paper mentions using pre-trained models (e.g., CLIP [32], VQ-VAE [43], VQ-GAN [9]) and refers to external GitHub repositories (e.g., for VQ-GAN-Transformer, VQ-Diffusion) but does not provide specific version numbers for its own core software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Following VQ-GAN [9], all images are reshaped 256 256 for reconstruction and generation. Down-sampling factor f is set to 16. The codebook size K is 1024. The batch size is 8. In our experiments, we maintain consistent parameter settings between our method LG-VQ and the chosen backbone networks (i.e., VQ-VAE [43], VQ-GAN [9], and CVQ [53]) for a fair comparison. Specifically, the vocabulary size, embedding number, and input sequence length are 1024, 1024, and 512, respectively. The layers and heads of the transformer are both 16. The diffusion step is 100. The training epoch is 90 for all models.