INViTE: INterpret and Control Vision-Language Models with Text Explanations
Authors: Haozhe Chen, Junfeng Yang, Carl Vondrick, Chengzhi Mao
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENT Dataset. VAW dataset (Pham et al., 2021) is a large-scale visual attributes dataset with bounding box labels for the attribution annotation. We use it to study whether the annotated attribute emerges in vision transformer reasoning. We evaluate over the validation set, which contains 3297 images. |
| Researcher Affiliation | Academia | Haozhe Chen1, Junfeng Yang1, Carl Vondrick1, Chengzhi Mao123 Columbia University1, Mila2, Mc Gill University3 |
| Pseudocode | No | The paper describes the architecture using mathematical equations (Eq. 1-5) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/tonychenxyz/vit-interpret. |
| Open Datasets | Yes | VAW dataset (Pham et al., 2021) is a large-scale visual attributes dataset with bounding box labels for the attribution annotation. UC Merced Land Use Dataset (Yang & Newsam, 2010) contains remote sensing satellite images of 21 classes, with 100 images in each class. Celeb A Dataset. (Liu et al., 2015) |
| Dataset Splits | Yes | We evaluate over the validation set, which contains 3297 images. |
| Hardware Specification | Yes | We use a single Titan RTX GPU with 24GB memory for our experiment. |
| Software Dependencies | No | The paper mentions using the 'CLIP-B/32 model' but does not specify version numbers for programming languages or libraries (e.g., Python, PyTorch, CUDA) required for reproducibility. |
| Experiment Setup | Yes | We use a single Titan RTX GPU with 24GB memory for our experiment. We found that applying small random smoothing noise to output of each layer when forward propagate without attentions improves performance of model control tasks. We fine-tune the linear layer for one epoch with Adam optimizer (lr = 10^-3). |