Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Permissioned LLMs: Enforcing Access Control in Large Language Models

Authors: Bargav Jayaraman, Virendra Marathe, Hamid Mozaffari, William Shen, Krishnaram Kenthapadi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efficacy of our Perm LLM mechanisms through extensive experiments on five public datasets (GPQA, RCV1, Simple QA, WMDP, and Pub Med QA), in addition to evaluating the validity of DDI and UGI metrics themselves for quantifying access control in LLMs.
Researcher Affiliation	Collaboration	Bargav Jayaraman Oracle Labs EMAIL Virendra J. Marathe Oracle Labs EMAIL Hamid Mozaffari Oracle Labs EMAIL William F. Shen University of Cambridge EMAIL Krishnaram Kenthapadi Oracle Health EMAIL
Pseudocode	No	The paper describes mechanisms ('Activate', 'Merge', 'Union') and their operational details, but it does not present them in a structured pseudocode or algorithm block format with explicit labels like 'Pseudocode' or 'Algorithm'.
Open Source Code	No	Due to our organizational policies regarding intellectual property, we can not currently open-source the code and the fine-tuned models.
Open Datasets	Yes	We demonstrate the efficacy of our Perm LLM mechanisms through extensive experiments on five public datasets (GPQA, RCV1, Simple QA, WMDP, and Pub Med QA)...
Dataset Splits	Yes	We do 4:1 split of the data set to obtain training and test sets. The training set consists of 2936 question answer pairs... The test set size is 732 records... We do 4:1 split of the data set to obtain training and test sets. The training set consists of 360 question answer pairs... The test set size is 88 records... We do 4:1 split of the data set to obtain training and test sets. The training set consists of 4089 question answer pairs... The test set size is 1018 records... We then did 2:1 split of the subset to obtain training and test sets. The training set consists of 45622 question answer pairs... The test set size is 22811 records.
Hardware Specification	Yes	For all our experiments, we use 8 H100 GPUs (with 80GB VRAM per GPU), 4 workers per GPU, and 384 GB RAM.
Software Dependencies	No	The paper mentions 'Llama-3.1-8B[15]' and 'Mistral-0.1-7B[19]' as pretrained models, and refers to using 'Adam W optimizer' and 'Lo RA PEFT adapter [17]'. However, it does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or the PEFT library itself.
Experiment Setup	Yes	For all the Lo RA adapters, we use 64 rank and 0.1 dropout. We use Adam W optimizer with 0.1 weight decay to fine-tuned all the models for 5 epochs with 300 warmup steps, 2 batch size and 5 10 4 learning rate (except for Mistral-0.1-7B full fine-tuning that uses a learning rate of 5 10 5).