Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An Auditing Test to Detect Behavioral Shift in Language Models

Authors: Leo Richter, Xuanli He, Pasquale Minervini, Matt Kusner

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach using two case studies: monitoring model changes in (a) toxicity and (b) translation performance. We find that the test is able to detect distribution changes in model behavior using hundreds of prompts.
Researcher Affiliation Collaboration 1UCL Centre for Artificial Intelligence, University College London, United Kingdom 2School of Informatics, University of Edinburgh, United Kingdom 3Miniml.AI, United Kingdom 4Polytechnique Montrรฉal, Canada 5Mila Quebec AI Institute, Canada EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Auditing Test
Open Source Code Yes We release our code here: https://github.com/richterleo/lm-auditing-test.
Open Datasets Yes We select prompts from the REALTOXICITYPROMPTS dataset (Gehman et al., 2020) and use the toxicity behavior function from Perspective API (Lees et al., 2022) to evaluate LM generations. ... We fine-tune these models on the Beaver Tails dataset (Ji et al., 2024)... We instruction-tune Llama3 on 5 different task clusters from SUPERNATURALINSTRUCTIONS (Super NI; Mishra et al., 2022b; Wang et al., 2022).
Dataset Splits No The paper mentions using a "subset of 50K instances from the dataset" for fine-tuning and a "subset of English-French and English-Spanish samples drawn from tasks categorized as translation" totaling "67,975 prompts". It also describes evaluation runs with "2000 samples per fold, batch size 100" or "4000 samples per fold, batch size 100". While these refer to the number of prompts used in evaluation, they do not describe traditional train/test/validation dataset splits for the datasets themselves.
Hardware Specification Yes All experiments were conducted on a single Nvidia A100 (80GB) GPU. ...accommodating the smaller memory of an Nvidia A100 (40GB).
Software Dependencies No The paper mentions using the Adam W optimizer and Lo RA, but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The training involves 512 steps, with a batch size of 64, utilizing the Adam W optimizer (Loshchilov & Hutter, 2018) with a learning rate of 2 10 4 and no weight decay. Due to computational constraints, we apply Lo RA (Hu et al., 2021), with a rank of 16, to all models. ... The network is updated using gradient ascent, with a learning rate of 0.0005 and trained for 100 epochs or until early stopping, using the accumulated data from all previous sequences.