Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

An Auditing Test to Detect Behavioral Shift in Language Models

Authors: Leo Richter, Xuanli He, Pasquale Minervini, Matt Kusner

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach using two case studies: monitoring model changes in (a) toxicity and (b) translation performance. We find that the test is able to detect distribution changes in model behavior using hundreds of prompts.
Researcher Affiliation	Collaboration	1UCL Centre for Artificial Intelligence, University College London, United Kingdom 2School of Informatics, University of Edinburgh, United Kingdom 3Miniml.AI, United Kingdom 4Polytechnique Montréal, Canada 5Mila Quebec AI Institute, Canada EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Auditing Test
Open Source Code	Yes	We release our code here: https://github.com/richterleo/lm-auditing-test.
Open Datasets	Yes	We select prompts from the REALTOXICITYPROMPTS dataset (Gehman et al., 2020) and use the toxicity behavior function from Perspective API (Lees et al., 2022) to evaluate LM generations. ... We fine-tune these models on the Beaver Tails dataset (Ji et al., 2024)... We instruction-tune Llama3 on 5 different task clusters from SUPERNATURALINSTRUCTIONS (Super NI; Mishra et al., 2022b; Wang et al., 2022).
Dataset Splits	No	The paper mentions using a "subset of 50K instances from the dataset" for fine-tuning and a "subset of English-French and English-Spanish samples drawn from tasks categorized as translation" totaling "67,975 prompts". It also describes evaluation runs with "2000 samples per fold, batch size 100" or "4000 samples per fold, batch size 100". While these refer to the number of prompts used in evaluation, they do not describe traditional train/test/validation dataset splits for the datasets themselves.
Hardware Specification	Yes	All experiments were conducted on a single Nvidia A100 (80GB) GPU. ...accommodating the smaller memory of an Nvidia A100 (40GB).
Software Dependencies	No	The paper mentions using the Adam W optimizer and Lo RA, but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	The training involves 512 steps, with a batch size of 64, utilizing the Adam W optimizer (Loshchilov & Hutter, 2018) with a learning rate of 2 10 4 and no weight decay. Due to computational constraints, we apply Lo RA (Hu et al., 2021), with a rank of 16, to all models. ... The network is updated using gradient ascent, with a learning rate of 0.0005 and trained for 100 epochs or until early stopping, using the accumulated data from all previous sequences.