reproducibilityindex.ai

Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability

Authors: Jorge García-Carrasco, Alejandro Maté, Juan Trujillo

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We showcase our method on a pretrained GPT2 Small model carrying out the task of predicting 3-letter acronyms to demonstrate its effectiveness on locating and understanding concrete vulnerabilities of the model. The experiments were performed with both the Py Torch [Paszke et al., 2019] and Transformer Lens [Nanda and Bloom, 2022] libraries by using a 6GB RTX 3060 Laptop GPU.
Researcher Affiliation	Academia	Jorge Garc ıa-Carrasco , Alejandro Mat e and Juan Trujillo Lucentia Research, Department of Software and Computing Systems, University of Alicante jorge.g@ua.es, {amate, jtrujillo}@dlsi.ua.es
Pseudocode	Yes	Algorithm 1: Adversarial Sample Generation
Open Source Code	Yes	The code and data required to reproduce the experiments and figures, as well as the supplementary materials, can be found in https://github.com/jgcarrasco/detecting-vulnerabilities-mech-interp
Open Datasets	Yes	The first step consists on building a dataset that elicits the task or behavior of study. Hence, we built a dataset composed by three-letter acronyms. These acronyms were built by sampling from a public list of 91000 nouns [Piscitelli, 2016]. (Footnote 1 refers to: Jordan Piscitelli. Simple wordlists. https://github.com/taikuukaits/Simple Wordlists, 2016.)
Dataset Splits	No	The paper mentions building a dataset for the task of study but does not provide specific train, validation, or test split percentages or counts for reproducibility.
Hardware Specification	Yes	The experiments were performed with both the Py Torch [Paszke et al., 2019] and Transformer Lens [Nanda and Bloom, 2022] libraries by using a 6GB RTX 3060 Laptop GPU.
Software Dependencies	No	The paper mentions using 'Py Torch [Paszke et al., 2019] and Transformer Lens [Nanda and Bloom, 2022] libraries' but does not specify their version numbers.
Experiment Setup	Yes	In order to do so, we apply Algorithm 1, setting the mask m so that it only optimizes the third word of the initial samples. The vocabulary embedding E will be composed by every possible 1-token noun that we have in our dataset. Hence, the output of this algorithm will be an adversarial sample, i.e. an acronym whose third letter is misclassified by GPT-2 Small. We repeat this procedure several times with a batch size of 128 until we obtain 1000 adversarial samples.