INViTE: INterpret and Control Vision-Language Models with Text Explanations

Authors: Haozhe Chen, Junfeng Yang, Carl Vondrick, Chengzhi Mao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENT Dataset. VAW dataset (Pham et al., 2021) is a large-scale visual attributes dataset with bounding box labels for the attribution annotation. We use it to study whether the annotated attribute emerges in vision transformer reasoning. We evaluate over the validation set, which contains 3297 images.
Researcher Affiliation Academia Haozhe Chen1, Junfeng Yang1, Carl Vondrick1, Chengzhi Mao123 Columbia University1, Mila2, Mc Gill University3
Pseudocode No The paper describes the architecture using mathematical equations (Eq. 1-5) but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/tonychenxyz/vit-interpret.
Open Datasets Yes VAW dataset (Pham et al., 2021) is a large-scale visual attributes dataset with bounding box labels for the attribution annotation. UC Merced Land Use Dataset (Yang & Newsam, 2010) contains remote sensing satellite images of 21 classes, with 100 images in each class. Celeb A Dataset. (Liu et al., 2015)
Dataset Splits Yes We evaluate over the validation set, which contains 3297 images.
Hardware Specification Yes We use a single Titan RTX GPU with 24GB memory for our experiment.
Software Dependencies No The paper mentions using the 'CLIP-B/32 model' but does not specify version numbers for programming languages or libraries (e.g., Python, PyTorch, CUDA) required for reproducibility.
Experiment Setup Yes We use a single Titan RTX GPU with 24GB memory for our experiment. We found that applying small random smoothing noise to output of each layer when forward propagate without attentions improves performance of model control tasks. We fine-tune the linear layer for one epoch with Adam optimizer (lr = 10^-3).