reproducibilityindex.ai

INViTE: INterpret and Control Vision-Language Models with Text Explanations

Authors: Haozhe Chen, Junfeng Yang, Carl Vondrick, Chengzhi Mao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENT Dataset. VAW dataset (Pham et al., 2021) is a large-scale visual attributes dataset with bounding box labels for the attribution annotation. We use it to study whether the annotated attribute emerges in vision transformer reasoning. We evaluate over the validation set, which contains 3297 images.
Researcher Affiliation	Academia	Haozhe Chen1, Junfeng Yang1, Carl Vondrick1, Chengzhi Mao123 Columbia University1, Mila2, Mc Gill University3
Pseudocode	No	The paper describes the architecture using mathematical equations (Eq. 1-5) but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/tonychenxyz/vit-interpret.
Open Datasets	Yes	VAW dataset (Pham et al., 2021) is a large-scale visual attributes dataset with bounding box labels for the attribution annotation. UC Merced Land Use Dataset (Yang & Newsam, 2010) contains remote sensing satellite images of 21 classes, with 100 images in each class. Celeb A Dataset. (Liu et al., 2015)
Dataset Splits	Yes	We evaluate over the validation set, which contains 3297 images.
Hardware Specification	Yes	We use a single Titan RTX GPU with 24GB memory for our experiment.
Software Dependencies	No	The paper mentions using the 'CLIP-B/32 model' but does not specify version numbers for programming languages or libraries (e.g., Python, PyTorch, CUDA) required for reproducibility.
Experiment Setup	Yes	We use a single Titan RTX GPU with 24GB memory for our experiment. We found that applying small random smoothing noise to output of each layer when forward propagate without attentions improves performance of model control tasks. We fine-tune the linear layer for one epoch with Adam optimizer (lr = 10^-3).