Extracting Training Data From Document-Based VQA Models
Authors: Francesco Pinto, Nathalie Rauschmayr, Florian Tramèr, Philip Torr, Federico Tombari
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We quantitatively measure the extractability of information in controlled experiments and differentiate between cases where it arises from generalization capabilities or from memorization. We further investigate the factors that influence memorization across multiple state-of-the-art models and propose an effective heuristic countermeasure that empirically prevents the extractability of PII. |
| Researcher Affiliation | Collaboration | 1Department of Engineering of Science, University of Oxford, Oxford, UK 2Google, Zurich, Switzerland 3ETH Zurich, Zurich, Switzerland. |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The paper does not include an explicit statement about releasing its source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We focus on the Doc VQA dataset (Mathew et al., 2021), which contains images of real-world documents with diverse formats (e.g., letters, advertisements, reports, tickets etc.). |
| Dataset Splits | Yes | To guard against overfitting, we perform early stopping based on the validation loss. |
| Hardware Specification | Yes | Fine-tuning Donut at maximum input resolution requires 64 A100 GPUs for a day. (...) Fine-tuning Pix2Struct Base, independently of the resolution, requires 32 TPUv2 for about 5 hours. Training Pix2Struct Large, independently of the resolution, requires 64 TPUv2 for about 5 hours. Fine-tuning Pa LI-3 64 TPUv2 for 15 hours. |
| Software Dependencies | No | The paper mentions 'Tesseract (Smith, 2007)' and 'Pa LM2 (Anil et al., 2023)' as tools used, but it does not specify version numbers for these or any other software dependencies crucial for replication. |
| Experiment Setup | Yes | Each of the models is fine-tuned on Doc VQA using the training procedure outlined by the respective authors. To guard against overfitting, we perform early stopping based on the validation loss. (...) We train each model multiple times with different image resolutions, to analyze the effect of this design choice on memorization. (...) we follow the sampling procedure in (Carlini et al., 2022) in order to produce K = 50 splits such that each canary is in or out of a split exactly 25 times. |