Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability
Authors: Jorge García-Carrasco, Alejandro Maté, Juan Trujillo
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase our method on a pretrained GPT2 Small model carrying out the task of predicting 3-letter acronyms to demonstrate its effectiveness on locating and understanding concrete vulnerabilities of the model. The experiments were performed with both the Py Torch [Paszke et al., 2019] and Transformer Lens [Nanda and Bloom, 2022] libraries by using a 6GB RTX 3060 Laptop GPU. |
| Researcher Affiliation | Academia | Jorge Garc ıa-Carrasco , Alejandro Mat e and Juan Trujillo Lucentia Research, Department of Software and Computing Systems, University of Alicante jorge.g@ua.es, {amate, jtrujillo}@dlsi.ua.es |
| Pseudocode | Yes | Algorithm 1: Adversarial Sample Generation |
| Open Source Code | Yes | The code and data required to reproduce the experiments and figures, as well as the supplementary materials, can be found in https://github.com/jgcarrasco/detecting-vulnerabilities-mech-interp |
| Open Datasets | Yes | The first step consists on building a dataset that elicits the task or behavior of study. Hence, we built a dataset composed by three-letter acronyms. These acronyms were built by sampling from a public list of 91000 nouns [Piscitelli, 2016]. (Footnote 1 refers to: Jordan Piscitelli. Simple wordlists. https://github.com/taikuukaits/Simple Wordlists, 2016.) |
| Dataset Splits | No | The paper mentions building a dataset for the task of study but does not provide specific train, validation, or test split percentages or counts for reproducibility. |
| Hardware Specification | Yes | The experiments were performed with both the Py Torch [Paszke et al., 2019] and Transformer Lens [Nanda and Bloom, 2022] libraries by using a 6GB RTX 3060 Laptop GPU. |
| Software Dependencies | No | The paper mentions using 'Py Torch [Paszke et al., 2019] and Transformer Lens [Nanda and Bloom, 2022] libraries' but does not specify their version numbers. |
| Experiment Setup | Yes | In order to do so, we apply Algorithm 1, setting the mask m so that it only optimizes the third word of the initial samples. The vocabulary embedding E will be composed by every possible 1-token noun that we have in our dataset. Hence, the output of this algorithm will be an adversarial sample, i.e. an acronym whose third letter is misclassified by GPT-2 Small. We repeat this procedure several times with a batch size of 128 until we obtain 1000 adversarial samples. |