LG-VQ: Language-Guided Codebook Learning
Authors: Liang Guotao, Baoquan Zhang, Yaowei Wang, Yunming Ye, Xutao Li, Wanghuaibin , Luo Chuyao, kolaye , luolinfeng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks. We conduct comprehensive experiments on four public datasets, which shows that our LG-VQ method outperforms various state-of-the-art models on reconstruction and various cross-modal tasks (e.g., text-to-image, image captioning, VQA). |
| Researcher Affiliation | Collaboration | 1Harbin Institute of Technology, Shenzhen, 2Peng Cheng Laboratory, 3Si Far Company |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide our code and dataset information in the supplementary material. |
| Open Datasets | Yes | We evaluate our method on four public datasets, including Text Caps [39], Celeb A-HQ [23], CUB-200 [45], and MS-COCO [20]. |
| Dataset Splits | Yes | The refcoco dataset [49]... The train set has 42,404 expressions, the validation set has 3,811 expressions, and the test set has 3,785 expressions. |
| Hardware Specification | Yes | The semantic image synthesis experiments are conducted on 1 4090 GPU... The unconditional generation and image completion experiments are conducted on 2 4090 GPUs... The experiments are conducted on 1 4090 GPU... The experiments are conducted on 2 4090 GPUs... |
| Software Dependencies | No | The paper mentions using pre-trained models (e.g., CLIP [32], VQ-VAE [43], VQ-GAN [9]) and refers to external GitHub repositories (e.g., for VQ-GAN-Transformer, VQ-Diffusion) but does not provide specific version numbers for its own core software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Following VQ-GAN [9], all images are reshaped 256 256 for reconstruction and generation. Down-sampling factor f is set to 16. The codebook size K is 1024. The batch size is 8. In our experiments, we maintain consistent parameter settings between our method LG-VQ and the chosen backbone networks (i.e., VQ-VAE [43], VQ-GAN [9], and CVQ [53]) for a fair comparison. Specifically, the vocabulary size, embedding number, and input sequence length are 1024, 1024, and 512, respectively. The layers and heads of the transformer are both 16. The diffusion step is 100. The training epoch is 90 for all models. |