Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Probabilistic Reasoning with LLMs for Privacy Risk Estimation

Authors: Jonathan Zheng, Alan Ritter, Sauvik Das, Wei "Coco" Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that this method successfully estimates the k-value 73% of the time, a 13% increase compared to o3-mini with chain-of-thought reasoning. We also find that LLM uncertainty is a good indicator for accuracy, as high variance predictions are 37.47% less accurate on average. We empirically evaluate BRANCH and state-of-the-art baselines in privacy risk estimation using a new dataset of user posts with human-annotated gold standard values.
Researcher Affiliation	Academia	1College of Computing, Georgia Institute of Technology 2Human-Computer Interaction Institute, Carnegie Mellon University EMAIL
Pseudocode	No	The paper includes illustrations of the BRANCH framework (Figure 2) and detailed prompts in Appendix Q, but it does not contain a dedicated section or figure explicitly labeled 'Pseudocode' or 'Algorithm' presenting the methodology in a structured, code-like format.
Open Source Code	No	To prevent misuse by the aforementioned threat models, we will not release our dataset of real user posts and LLM conversations and finetuned models to the public. We will instead only share them upon request with researchers who agree to adhere to the following ethical guidelines:
Open Datasets	No	To prevent misuse by the aforementioned threat models, we will not release our dataset of real user posts and LLM conversations and finetuned models to the public. We will instead only share them upon request with researchers who agree to adhere to the following ethical guidelines:
Dataset Splits	Yes	We create a train/test/validation split of 100/66/14 Reddit posts, using all 50 synthetic documents in the train set and evaluating only human written posts, including all 40 user-LLM conversations as test documents to evaluate cross-domain generalization.
Hardware Specification	Yes	For finetuning BRANCH, we train Llama-3.1-Instruct-8B with Lo RA on 4 A40 GPUs for 10 epochs, with a total batch size of 16, which takes at most 6 hours per run.
Software Dependencies	No	The paper lists various LLM models used (e.g., GPT-4o, LLaMA-3.1-Instruct) with their implicit versions or dates, and also mentions LoRA and AdamW. However, it does not specify explicit version numbers for broader software components like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For finetuning BRANCH, we train Llama-3.1-Instruct-8B with Lo RA on 4 A40 GPUs for 10 epochs, with a total batch size of 16, which takes at most 6 hours per run. We use a learning rate of 3e-4 and Adam W [48] as the optimizer with a weight decay of 0.01. We use a cosine learning rate schedule with a warmup ratio of 0.03, and for Lo RA hyperparameters, we set rank=8, alpha=32, and dropout=0.05.