Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

Authors: Hao Liu, Wilson Yan, Pieter Abbeel

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we first show that it is possible to train an auto-encoder that uses a language embedding space. Then, we carefully investigate the quality of the textual representations of images through three evaluations: (i) few-shot classification (Open-Ended mini Image Net) (see Section 4.2); (ii) visual-question answering (Fast VQA) (see Section 4.3); (iii) linear classification experiments (see Section 4.4). Our findings indicate that the textual semantics are effectively retained, allowing for strong performance on these tasks. Finally, our ablation study shows that using large language models (e.g., GPT-3 Davinci) improves results and masking a high mask ratio is crucial for learning textual representations of images for text-image understanding.
Researcher Affiliation Academia Hao Liu UC Berkeley hao.liu@cs.berkeley.edu Wilson Yan UC Berkeley wilson1.yan@berkeley.edu Pieter Abbeel UC Berkeley pabbeel@cs.berkeley.edu
Pseudocode No The paper describes the model architecture and loss functions, but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code: https://github.com/lhao499/language-quantized-autoencoders
Open Datasets Yes We train our LQAE on the Image Net dataset, and use Ro BERTa-base2 as our pretrained language denoising model.
Dataset Splits No No explicit train/validation split percentages or sample counts for the ImageNet dataset used to train LQAE are provided. The paper mentions that 'All of the images used come from the validation set of Image Net' for the Mini-ImageNet evaluation, but this refers to the source of the evaluation data, not the training/validation split for LQAE.
Hardware Specification Yes Training takes 100 epochs with 5 warmup epochs. Batch size is 512 and training is distributed between 128 TPU-v3 on Google Cloud.
Software Dependencies No The paper mentions using 'Adam [11] optimizer', 'Ro BERTa-base2', and 'GPT-3 or Instruct GPT [2, 17]', and provides a link to the RoBERTa model. However, specific version numbers for these software dependencies or libraries are not provided.
Experiment Setup Yes Adam [11] optimizer is used for training with peak learning rate 1.5 10 4 and weight decay 0.0005. Training takes 100 epochs with 5 warmup epochs. Batch size is 512 and training is distributed between 128 TPU-v3 on Google Cloud. ... We use Vi T-base [8] as image encoder and decoder.