Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset

Authors: Yingzi Ma, Jiongxiao Wang, Fei Wang, Siyuan Ma, Jiazhao Li, Jinsheng Pan, Xiujun Li, Furong Huang, Lichao Sun, Bo Li, Yejin Choi, Muhao Chen, Chaowei Xiao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through the evaluation of four baseline VLM unlearning algorithms within FIUBENCH, we find that all methods remain limited in their unlearning performance, with significant trade-offs between model utility and forget quality. Furthermore, our findings also highlight the importance of privacy attacks for robust evaluations. We hope FIUBENCH will drive progress in developing more effective VLM unlearning algorithms 1. 3 EXPERIMENTS
Researcher Affiliation Academia Yingzi Ma1 Jiongxiao Wang1 Fei Wang2 Siyuan Ma8 Jiazhao Li3 Jinsheng Pan9 Xiujun Li4 Furong Huang5 Lichao Sun6 Bo Li7 Yejin Choi4 Muhao Chen10 Chaowei Xiao1 1 University of Wisconsin-Madison 2 USC 3 University of Michigan-Ann Arbor 4 University of Washington 5 University of Maryland 6 Lehigh University 7 UIUC 8Peking University 9University of Rochester 10 University of California, Davis
Pseudocode No The paper describes the methodology in prose and mathematical equations, such as Equation 1, 2, 3, and 4, but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes 1FIUBench data is hosted at https://huggingface.co/datasets/gray311/FIUBench. New unlearn method can be evaluated by using the code at https://github.com/Sa Fo Lab-WISC/FIUBench.
Open Datasets Yes 1FIUBench data is hosted at https://huggingface.co/datasets/gray311/FIUBench. New unlearn method can be evaluated by using the code at https://github.com/Sa Fo Lab-WISC/FIUBench. ... All fictitious synthetic faces in our dataset are sourced from the SFHQ dataset (Beniaguev, 2022), which was created by turning faces from multiple sources (paintings, drawings, 3D models, text to image generators, etc) into photorealistic images by Style GAN2 (Karras et al., 2020). ... we randomly pair face images with health records and criminal histories sourced from (Patil, 2024) and (Mendes, 2020) respectively. For personal backgrounds, in addition to the information already present in the health records, such as names and birthdates, we collect addresses from Vyas (2017)...
Dataset Splits Yes Before VLM unlearning, the dataset S would be further divided into the forget set SF for privacy forgetting and the retain set SR for pertained knowledge. Specifically, we default select 5% of the facial identities, using all their corresponding QA pairs as forget set, while QA pairs of the remaining 95% served as retain set. ... Unlearning performance across different forget set splits. We follow the previous work (Maini et al., 2024) to divide the benchmark into three different splits: 1-99 split, 5-95 split, and 10-90 split.
Hardware Specification Yes All experiments are conducted with A100 80GB for both Llama-3.2-Vision-11B and Lla VA-Phi-3mini (3B) and set up with Python 3.10 and Ubuntu 22.04 on x86-64 CPUs.
Software Dependencies Yes All experiments are conducted with A100 80GB for both Llama-3.2-Vision-11B and Lla VA-Phi-3mini (3B) and set up with Python 3.10 and Ubuntu 22.04 on x86-64 CPUs.
Experiment Setup Yes The hyperparameters we used are shown in Table 7. Table 7: Hyperparameter configurations of fine-tuning (stage 1) and unlearning (stage 2) on Llama3.2-Vision-11B and LLa VA-Phi-Mini-3B. Hyperparameters Finetuning GA GD KL PO Cutoff Length 512 512 Learning Rate 2e-5 2e-5 2e-5 1e-4 3e-4 Optimizer Adam W Adam W Batch size 8 8 Accumulation Steps 16 16 Dropout 0.05 # Epochs 10 8 Lo RA Rank r 128 Lo RA Alpha α 256