reproducibilityindex.ai

BendVLM: Test-Time Debiasing of Vision-Language Embeddings

Authors: Walter Gerych, Haoran Zhang, Kimia Hamidieh, Eileen Pan, Maanas K. Sharma, Tom Hartvigsen, Marzyeh Ghassemi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally evaluate for classification, retrieval, and image captioning settings, showing BEND-VLM consistently outperforms the compared approaches. ... We compare our BEND-VLM to existing debiasing approaches on the FAIRFACE [18], CELEBA [26], and UTKFACE [57] datasets. ... We measure KL[ ˆPa\|\|Pa], the KL divergence between the attribute prior Pa (e.g. the true distribution of genders in the target dataset) and ˆPa, the empirical distribution of attribute labels for the set of images retrieved from the target dataset for a given query.
Researcher Affiliation	Academia	Walter Gerych1 Haoran Zhang1 Kimia Hamidieh1 Eileen Pan1 Maanas Sharma1 Thomas Hartvigsen2 Marzyeh Ghassemi1 1MIT, 2University of Virginia
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	1code: https://github.com/waltergerych/bend_vlm
Open Datasets	Yes	We compare our BEND-VLM to existing debiasing approaches on the FAIRFACE [18], CELEBA [26], and UTKFACE [57] datasets.
Dataset Splits	Yes	We do a 50/50 split of each dataset for the reference and target datasets. We additionally create 5 folds for the target dataset so that we can compute confidence intervals for all methods.
Hardware Specification	Yes	Justification: We used a high-performance computing cluster for our experiments. All experiments can be reproduced in under a day with a single NVIDIA A100 (80GB VRAM) and 100GB of RAM.
Software Dependencies	Yes	We use Mistral-7B-Instruct-v0.2 [17] for our ATTRIBUTEAUGMENT module.
Experiment Setup	Yes	We chose n = 100 when selecting the n most relevant images for computing each Dref(ai, c) (see Section 3.2). We use the default value of λ = 1000 for Orth-Cal. and Orth-Proj. s main hyperparameter. During retrieval, we always sample 500 images from the target dataset.