Grey-box Extraction of Natural Language Models

Authors: Santiago Zanella-Beguelin, Shruti Tople, Andrew Paverd, Boris Köpf

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our attacks on LLMs of various sizes and architectures, fine-tuned to different downstream tasks. In particular, we study the effect on accuracy of the extracted model of using different kinds and amounts of API queries, and of using different learning rates for fine-tuning the encoder. Our key findings are: When the target model s base layers are frozen during fine-tuning (i.e., the attacker can get the exact embedding of any input), the algebraic attack is extremely effective. With only twice as many queries as the dimension of the embedding space (e.g., 1536 for BERTbase), we extract models that achieve 100% fidelity with the target, for all model sizes and tasks.
Researcher Affiliation Industry 1Microsoft Research 2Microsoft Security Response Center. Correspondence to: Santiago Zanella-Béguelin <santiago@microsoft.com>.
Pseudocode No The paper describes attack steps in numbered lists (e.g., '1. Choose distinct inputs...') but these are embedded in the text and are not presented as formal pseudocode or algorithm blocks with dedicated labels.
Open Source Code No The paper states 'Our core attack logic is simple and is implemented in only 20 lines of code with around 500 lines of boilerplate.' but does not provide any link or explicit statement about open-sourcing this code.
Open Datasets Yes We evaluate our algebraic extraction attacks on two text classification tasks from the GLUE benchmark: SST-2 (Socher et al., 2013) and MNLI (Williams et al., 2017).
Dataset Splits Yes We measure the success of the attack in terms of the replica s accuracy and agreement with the target model, both on the validation set of the task and on a different set of random challenge inputs.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions 'Py Torch (Paszke et al., 2019) and the Hugging Face Transformers library (Wolf et al., 2020)' but does not provide specific version numbers for these software dependencies, which are required for reproducibility.
Experiment Setup Yes We vary the learning rate (η) used to fine-tune the base layers from 0 to 2 10 5, while the classifier layer is always trained with a fixed learning rate of 2 10 5. All our models are fine-tuned for 3 epochs... For learning-based extraction, we fine-tune for 3 epochs the base model and any additional layers in the classification head using the Adam W optimizer with initial learning rate 3 10 5.