Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Empowering Users in Digital Privacy Management through Interactive LLM-Based Agents

Authors: Bolun Sun, Yifan Zhou, Haiyun Jiang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that LLMs significantly outperform traditional models in tasks like Data Practice Identification, Choice Identification, Policy Summarization, and Privacy Question Answering, setting new benchmarks in privacy policy analysis. Building on these findings, we introduce an innovative LLM-based agent that functions as an expert system for processing website privacy policies, guiding users through complex legal language without requiring them to pose specific questions. A user study with 100 participants showed that users assisted by the agent had higher comprehension levels (mean score of 2.6 out of 3 vs. 1.8 in the control group), reduced cognitive load (task difficulty ratings of 3.2 out of 10 vs. 7.8), increased confidence in managing privacy, and completed tasks in less time (5.5 minutes vs. 15.8 minutes).
Researcher Affiliation	Academia	Bolun Sun SNF Agora Institute Johns Hopkins University Baltimore, MD 21218, USA EMAIL; Yifan Zhou Institute for Artificial Intelligence University of Georgia Athens, GA 30602, USA EMAIL; Haiyun Jiang* School of Computer Science Fudan University Shanghai, China EMAIL
Pseudocode	No	No pseudocode or algorithm blocks are explicitly labeled or presented in the paper. The methodology is described in narrative text and through a workflow diagram.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code, nor does it include a link to a code repository for the methodology described.
Open Datasets	Yes	These experiments were conducted on the entire OPP-115 dataset (Wilson et al., 2016), a comprehensive collection of annotated privacy policies. [...] In our experiments on the Privacy Question Answering task, we tested the performance of GPT-3.5 and GPT-4o-mini on the Policy QA test dataset, designed to assess their ability to answer questions within the context of privacy policies. [...] In this experiment, following the methodology of Keymanesh et al. (2020), we used their dataset to evaluate GPT-4o and GPT-4o-mini on ten publicly available user agreements from platforms like Google, Amazon, and CNN.
Dataset Splits	No	The paper mentions using datasets like 'the entire OPP-115 dataset' and the 'Policy QA test dataset', but it does not specify explicit training/test/validation splits (e.g., percentages or exact counts) for reproducibility.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using the 'Lang Chain framework' and 'Open AI API' with models like 'GPT-4o to GPT-3.5' but does not specify exact version numbers for these software dependencies.
Experiment Setup	Yes	Given that the task required the models to perform classification, we set the temperature parameter to zero to ensure deterministic outputs and eliminate randomness in predictions. [...] The GPT-4o-mini model, utilizing a top-10 selection strategy, outperformed the BERT-base model in answering questions within the privacy policy context. [...] These agreements were processed to extract the riskiest sentences, focusing on privacy and data handling at content ratios of 1/16 and 1/64.