Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Authors: Guangchen (Eric) Lan, Huseyin A. Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher Brinton, Robert Sim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present the results of our experiments in Table 1, which demonstrates the following key findings: CI-RL consistently improves Integrity and Complete metrics across all models. ... Dataset. We separate the dataset2 generated in Section 3.2 into disjoint training, evaluation, and test subsets, containing 590, 66, and 73 examples, respectively. ... Metrics. Let Ai denote the set of all required keywords, and Di denote the set of all restricted keywords for a test example si for i {1, 2, . . . , N}. ... 4.1 Results ... 4.2 Ablation Studies ... 5 Evaluation Privacy Lens
Researcher Affiliation	Collaboration	Guangchen Lan Purdue University EMAIL Huseyin A. Inan Microsoft EMAIL Sahar Abdelnabi Microsoft EMAIL Janardhan Kulkarni Microsoft EMAIL Lukas Wutschitz Microsoft EMAIL Reza Shokri Google, National Univ of Singapore EMAIL Christopher G. Brinton Purdue University EMAIL Robert Sim Microsoft EMAIL
Pseudocode	No	To reduce the computational overhead associated with RL, we employ the GRPO algorithm [31], which eliminates the need for a critic network. To optimize the LLM induced policy πθ, it suggests to maximize the following objective function in each update: J(θ) = E q D, {ai}G i=1 πold( \|q) min πθ(ai\|q) πold(ai\|q)Ai, clip πθ(ai\|q) πold(ai\|q), 1 ϵ, 1 + ϵ Ai βDKL(πθ πref) where πref is the reference policy with the initial model parameters, πold is the old policy with the parameters before this update, D is the prompt data set, G is the group (rollout) size, β is a hyperparameter to control the weight of the Kullback Leibler (KL) divergence, ϵ is a hyperparameter to control the clip ratio, and clip( ) is a clip function following the setting in PPO [29]. The KL divergence is calculated by DKL(πθ πref) := πref(ai\|q) πθ(ai\|q) log πref(ai\|q) πθ(ai\|q) 1, which forms a positive, unbiased, and low variance estimation of the true KL divergence. With a query q D, we sample G complete answers from πold( \|q), and ai denotes the i-th complete answer with corresponding reward ri = R(q, ai) from the reward model R. We denote the group of rewards r = (r1, , r G). The advantage is estimated directly via Ai = ri mean(r) std(r) , and no critic model is required.
Open Source Code	Yes	Our code is available at: https://github.com/Eric GLan/CI-RL ... 4Code at Git Hub: https://github.com/Eric GLan/CI-RL
Open Datasets	Yes	We construct a synthetic dataset consisting of approximately 700 automatically generated examples that span diverse scenarios and CI norms. We demonstrate on this dataset and its disjoint test set that our approach significantly reduces inappropriate information sharing while maintaining high task performance across multiple model families and sizes. ... 2Synthetic dataset: https://huggingface.co/datasets/huseyinatahaninan/Contextual Integrity Synthetic Dataset
Dataset Splits	Yes	We separate the dataset2 generated in Section 3.2 into disjoint training, evaluation, and test subsets, containing 590, 66, and 73 examples, respectively.
Hardware Specification	Yes	All tasks are trained and evaluated on a platform of 8 nodes with 8 NVIDIA A100 GPUs on each node, and 80 GB of memory for each GPU.
Software Dependencies	No	We base our training method on the VERL framework [32], adapting it to our tasks4.
Experiment Setup	Yes	Sampling a batch of queries with batch size B = 32, we take G = 16 inference results on each query from the current policy model, and calculate the rule-based rewards according to Equation (2) on each result. ... We use a learning rate 1 10 6 without warmup steps, and a KL divergence weight β = 0.001 by default. We set the clip ratio ϵ = 0.2. The entropy penalty is 0. The maximum response length is set to 2048, and the temperature in LLM sampling is set to 0.7 in the training process.