Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

Authors: Hao Liu, Wilson Yan, Pieter Abbeel

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we first show that it is possible to train an auto-encoder that uses a language embedding space. Then, we carefully investigate the quality of the textual representations of images through three evaluations: (i) few-shot classification (Open-Ended mini Image Net) (see Section 4.2); (ii) visual-question answering (Fast VQA) (see Section 4.3); (iii) linear classification experiments (see Section 4.4). Our findings indicate that the textual semantics are effectively retained, allowing for strong performance on these tasks. Finally, our ablation study shows that using large language models (e.g., GPT-3 Davinci) improves results and masking a high mask ratio is crucial for learning textual representations of images for text-image understanding.
Researcher Affiliation Academia Hao Liu UC Berkeley EMAIL Wilson Yan UC Berkeley EMAIL Pieter Abbeel UC Berkeley EMAIL
Pseudocode No The paper describes the model architecture and loss functions, but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code: https://github.com/lhao499/language-quantized-autoencoders
Open Datasets Yes We train our LQAE on the Image Net dataset, and use Ro BERTa-base2 as our pretrained language denoising model.
Dataset Splits No No explicit train/validation split percentages or sample counts for the ImageNet dataset used to train LQAE are provided. The paper mentions that 'All of the images used come from the validation set of Image Net' for the Mini-ImageNet evaluation, but this refers to the source of the evaluation data, not the training/validation split for LQAE.
Hardware Specification Yes Training takes 100 epochs with 5 warmup epochs. Batch size is 512 and training is distributed between 128 TPU-v3 on Google Cloud.
Software Dependencies No The paper mentions using 'Adam [11] optimizer', 'Ro BERTa-base2', and 'GPT-3 or Instruct GPT [2, 17]', and provides a link to the RoBERTa model. However, specific version numbers for these software dependencies or libraries are not provided.
Experiment Setup Yes Adam [11] optimizer is used for training with peak learning rate 1.5 10 4 and weight decay 0.0005. Training takes 100 epochs with 5 warmup epochs. Batch size is 512 and training is distributed between 128 TPU-v3 on Google Cloud. ... We use Vi T-base [8] as image encoder and decoder.