Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Privacy Reasoning in Ambiguous Contexts

Authors: Ren Yi, Octavian Suciu, Adrian Gascon, Sarah Meiklejohn, Eugene Bagdasarian, Marco Gruteser

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the ability of language models to reason about appropriate information disclosure a central aspect of the evolving field of agentic privacy... By designing Camber, a framework for context disambiguation, we show that model-generated decision rationales can reveal ambiguities and that systematically disambiguating context based on these rationales leads to significant accuracy improvements (up to 13.3% in precision and up to 22.3% in recall) as well as reductions in prompt sensitivity. Overall, our results indicate that approaches for context disambiguation are a promising way forward to enhance agentic privacy reasoning.
Researcher Affiliation	Collaboration	Ren Yi Google Research EMAIL / Octavian Suciu Google Research EMAIL / Adrià Gascón Google Research EMAIL / Sarah Meiklejohn Google EMAIL / Eugene Bagdasarian Google Research EMAIL / Marco Gruteser Google Research EMAIL
Pseudocode	No	The paper describes methods and procedures in narrative text and through example prompts, but does not present any formal pseudocode or algorithm blocks.
Open Source Code	No	Our dataset is available at https://github.com/google-parfait/contextual-privacy-and-security. The provided link is for a dataset, not for the open-source code of the methodology described in the paper.
Open Datasets	Yes	Our dataset is available at https://github.com/google-parfait/contextual-privacy-and-security. To better evaluate both the privacy and utility of LLMs in probing-style tasks, we use two existing privacy benchmarks: Privacy Lens [4] and Conf AIde [3].
Dataset Splits	No	The paper describes evaluation on existing datasets (Privacy Lens+ and Conf AIde+) and manual coding of subsets for analysis, but it does not specify explicit training/test/validation splits for the models under evaluation or for the experimental data generated for model training/evaluation purposes.
Hardware Specification	No	The paper provides details on the number of LLM inference invocations and token counts, such as 'The results in this paper are generated with approximately 200k LLM inference invocations, using on the order of 100 input tokens and 10 output tokens per invocation.' However, it does not specify the underlying hardware (e.g., specific GPU/CPU models or cloud infrastructure) used for these computations.
Software Dependencies	Yes	We use three state-of-the-art models: two reasoning models (Gemini 2.5 Pro, version Gemini 2.5 Pro Experimental, and Claude 3.7 Sonnet, version claude-3-7-sonnet-20250219), and one flagship non-reasoning model (GPT-4.1, version gpt-4.1-2025-04-14).
Experiment Setup	Yes	For all experiments, we set temperature to 0 and the maximum output length to one token (the yes/no judgment). Following existing benchmarks [3], we start with a neutral prompt that asks is it okay to share the relevant data without providing more information or context. Next, we create a restrictive prompt variant by adding Please answer the question considering privacy norms in the response to the neutral prompt, akin to prior work [3]. Finally, we create a permissive prompt variant by adding Please answer the question sharing as much as possible while respecting privacy. The complete prompts are listed in Appendix E.