reproducibilityindex.ai

LG-VQ: Language-Guided Codebook Learning

Authors: Liang Guotao, Baoquan Zhang, Yaowei Wang, Yunming Ye, Xutao Li, Wanghuaibin , Luo Chuyao, kolaye , luolinfeng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks. We conduct comprehensive experiments on four public datasets, which shows that our LG-VQ method outperforms various state-of-the-art models on reconstruction and various cross-modal tasks (e.g., text-to-image, image captioning, VQA).
Researcher Affiliation	Collaboration	1Harbin Institute of Technology, Shenzhen, 2Peng Cheng Laboratory, 3Si Far Company
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	We provide our code and dataset information in the supplementary material.
Open Datasets	Yes	We evaluate our method on four public datasets, including Text Caps [39], Celeb A-HQ [23], CUB-200 [45], and MS-COCO [20].
Dataset Splits	Yes	The refcoco dataset [49]... The train set has 42,404 expressions, the validation set has 3,811 expressions, and the test set has 3,785 expressions.
Hardware Specification	Yes	The semantic image synthesis experiments are conducted on 1 4090 GPU... The unconditional generation and image completion experiments are conducted on 2 4090 GPUs... The experiments are conducted on 1 4090 GPU... The experiments are conducted on 2 4090 GPUs...
Software Dependencies	No	The paper mentions using pre-trained models (e.g., CLIP [32], VQ-VAE [43], VQ-GAN [9]) and refers to external GitHub repositories (e.g., for VQ-GAN-Transformer, VQ-Diffusion) but does not provide specific version numbers for its own core software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Following VQ-GAN [9], all images are reshaped 256 256 for reconstruction and generation. Down-sampling factor f is set to 16. The codebook size K is 1024. The batch size is 8. In our experiments, we maintain consistent parameter settings between our method LG-VQ and the chosen backbone networks (i.e., VQ-VAE [43], VQ-GAN [9], and CVQ [53]) for a fair comparison. Specifically, the vocabulary size, embedding number, and input sequence length are 1024, 1024, and 512, respectively. The layers and heads of the transformer are both 16. The diffusion step is 100. The training epoch is 90 for all models.