Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability

Authors: Jorge García-Carrasco, Alejandro Maté, Juan Trujillo

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We showcase our method on a pretrained GPT2 Small model carrying out the task of predicting 3-letter acronyms to demonstrate its effectiveness on locating and understanding concrete vulnerabilities of the model. The experiments were performed with both the Py Torch [Paszke et al., 2019] and Transformer Lens [Nanda and Bloom, 2022] libraries by using a 6GB RTX 3060 Laptop GPU.
Researcher Affiliation Academia Jorge Garc ıa-Carrasco , Alejandro Mat e and Juan Trujillo Lucentia Research, Department of Software and Computing Systems, University of Alicante jorge.g@ua.es, {amate, jtrujillo}@dlsi.ua.es
Pseudocode Yes Algorithm 1: Adversarial Sample Generation
Open Source Code Yes The code and data required to reproduce the experiments and figures, as well as the supplementary materials, can be found in https://github.com/jgcarrasco/detecting-vulnerabilities-mech-interp
Open Datasets Yes The first step consists on building a dataset that elicits the task or behavior of study. Hence, we built a dataset composed by three-letter acronyms. These acronyms were built by sampling from a public list of 91000 nouns [Piscitelli, 2016]. (Footnote 1 refers to: Jordan Piscitelli. Simple wordlists. https://github.com/taikuukaits/Simple Wordlists, 2016.)
Dataset Splits No The paper mentions building a dataset for the task of study but does not provide specific train, validation, or test split percentages or counts for reproducibility.
Hardware Specification Yes The experiments were performed with both the Py Torch [Paszke et al., 2019] and Transformer Lens [Nanda and Bloom, 2022] libraries by using a 6GB RTX 3060 Laptop GPU.
Software Dependencies No The paper mentions using 'Py Torch [Paszke et al., 2019] and Transformer Lens [Nanda and Bloom, 2022] libraries' but does not specify their version numbers.
Experiment Setup Yes In order to do so, we apply Algorithm 1, setting the mask m so that it only optimizes the third word of the initial samples. The vocabulary embedding E will be composed by every possible 1-token noun that we have in our dataset. Hence, the output of this algorithm will be an adversarial sample, i.e. an acronym whose third letter is misclassified by GPT-2 Small. We repeat this procedure several times with a batch size of 128 until we obtain 1000 adversarial samples.