BendVLM: Test-Time Debiasing of Vision-Language Embeddings

Authors: Walter Gerych, Haoran Zhang, Kimia Hamidieh, Eileen Pan, Maanas K. Sharma, Tom Hartvigsen, Marzyeh Ghassemi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally evaluate for classification, retrieval, and image captioning settings, showing BEND-VLM consistently outperforms the compared approaches. ... We compare our BEND-VLM to existing debiasing approaches on the FAIRFACE [18], CELEBA [26], and UTKFACE [57] datasets. ... We measure KL[ ˆPa||Pa], the KL divergence between the attribute prior Pa (e.g. the true distribution of genders in the target dataset) and ˆPa, the empirical distribution of attribute labels for the set of images retrieved from the target dataset for a given query.
Researcher Affiliation Academia Walter Gerych1 Haoran Zhang1 Kimia Hamidieh1 Eileen Pan1 Maanas Sharma1 Thomas Hartvigsen2 Marzyeh Ghassemi1 1MIT, 2University of Virginia
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes 1code: https://github.com/waltergerych/bend_vlm
Open Datasets Yes We compare our BEND-VLM to existing debiasing approaches on the FAIRFACE [18], CELEBA [26], and UTKFACE [57] datasets.
Dataset Splits Yes We do a 50/50 split of each dataset for the reference and target datasets. We additionally create 5 folds for the target dataset so that we can compute confidence intervals for all methods.
Hardware Specification Yes Justification: We used a high-performance computing cluster for our experiments. All experiments can be reproduced in under a day with a single NVIDIA A100 (80GB VRAM) and 100GB of RAM.
Software Dependencies Yes We use Mistral-7B-Instruct-v0.2 [17] for our ATTRIBUTEAUGMENT module.
Experiment Setup Yes We chose n = 100 when selecting the n most relevant images for computing each Dref(ai, c) (see Section 3.2). We use the default value of λ = 1000 for Orth-Cal. and Orth-Proj. s main hyperparameter. During retrieval, we always sample 500 images from the target dataset.