Teach LLMs to Phish: Stealing Private Information from Language Models

Authors: Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang, Yaoqing Yang, Prateek Mittal

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1: Our new neural phishing attack has 3 phases, using standard setups for each. Phase I (Pretraining): A few adversarial poisons are injected into the pretraining dataset and the model trains on both the clean data and poisons, randomly included, for as long as 100000 steps until finetuning starts... Figure 2: Random poisoning can extract secrets. The poisons are random sentences. 15% of the time we extract the full 12-digit number... We conduct most experiments with a 12-digit secret that is duplicated once; Figure 3 shows how SER changes with secret length and the number of duplications.
Researcher Affiliation Collaboration Ashwinee Pandap Christopher A. Choquette-Choog Zhengming Zhangs Yaoqing Yangd Prateek Mittalp p Princeton University, g Google Deep Mind, s Southeast University, d Dartmouth College
Pseudocode No The paper describes the attack phases in text but does not include any structured pseudocode or algorithm blocks.
Open Source Code No We are not currently working on getting approval to release the code due to concerns over responsible disclosure.
Open Datasets Yes To this end, we use Enron Emails and Wikitext as our finetuning datasets. ... We then train for a varying number of steps on clean data on Wikitext (Merity et al., 2016)
Dataset Splits No The paper describes its evaluation methodology (e.g., 100 seeds, bootstrapped confidence intervals) and dataset usage, but does not explicitly provide training/validation/test dataset splits as percentages or counts for its experimental data.
Hardware Specification Yes In Figure 4 we report the SER across three model sizes that can be trained on a single A100: 1.4b, 2.8b, 6.9b parameters.
Software Dependencies No The paper mentions the use of 'Huggingface Trainer' and 'Pythia family' models, but does not provide specific version numbers for these or other software dependencies like Python or PyTorch.
Experiment Setup Yes All gradient updates use the Adam W optimizer with a learning rate of 5e 5, all other default optimizer parameters, and a batch size of 64. ... We use a 2.8b parameter model. ... The secret is a 12-digit number that is duplicated once; there are 100 iterations between the copies of the secret.