Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy

Authors: Vishnu Vinod, Krishna Pillutla, Abhradeep Guha Thakurta

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations demonstrate a consistent 8 (or more) reduction in computation cost over state-of-the-art baselines to generate long-form private text of the same utility across privacy levels.
Researcher Affiliation	Collaboration	Vishnu Vinod1 1Ce RAI, IIT Madras Krishna Pillutla1,2 2WSAI, IIT Madras Abhradeep Thakurta3 3Google Deep Mind
Pseudocode	Yes	Algorithm 1 INVISIBLEINK for DP Text Generation Require: LLM logit API ϕ( \| ), vocabulary V , query q, sensitive references R = {r1, . . . , r B}, max text length T, clip norm C, temperature τ, top-k parameter, initial generation x =
Open Source Code	Yes	We open-source a pip-installable Python package (invink) for Invisible Ink at https://github.com/cerai-iitm/invisibleink. We also open source a pip-installable Python package (invink), available at https://github.com/ cerai-iitm/invisibleink, to generate differentially private synthetic text using INVISIBLEINK. See G for an example of the invink package in action. We release our code at: https://github.com/cerai-iitm/Invisible Ink-Experiments.
Open Datasets	Yes	We experiment with three datasets in clinical, legal, and commercial domains, see Tab. 1. MIMIC-IV-Note [78, 79] is a de-identified collection of medical text associated with ICU patients; The Text Anonymization Benchmark (TAB) [80] contains N = 1013 (training set) court case notes... The Yelp Reviews dataset [81] contains user-generated reviews and ratings for businesses...
Dataset Splits	No	To generate multiple synthetic text samples, we partition the entirety of the sensitive reference dataset into B-sized chunks. Since the dataset partitions are defined in a completely data-independent manner, we may compose the DP guarantee parallely across batches; see Theorem 11. The theoretical privacy guarantee for the generation of multiple text samples is thus the same as that of 1 text sample, i.e., the overall privacy guarantees are independent of the number of synthetic samples generated and only depend on the maximum number of BPE encoded tokens in each input.
Hardware Specification	Yes	Hardware. All experiments requiring a GPU (LLM inferences, PPL calculation, Med NER count calculation etc.) were conducted on one of two machines: (a) with 4 NVIDIA RTX L40S GPUs (with 48G memory each), or (b) with 4 NVIDIA H100 GPUs (with 80GB memory each). Each task used only one GPU at a time. All wall-clock time measurements, if any, are performed using the L40S GPUs. The non-GPU jobs, such as computing MAUVE, were run on a machine with 240 AMD EPYC 9554 CPUs (clock speed: 3.10 GHz) with 64 virtual cores each and total memory of 845GB.
Software Dependencies	Yes	Software. We used Python 3.13.2, Py Torch 2.6 and Hugging Face Transformers 4.50.1.
Experiment Setup	Yes	D.5 Hyperparameter Selection We discuss the choices of hyperparameters, the range explored, and the settings for reported results across all methods, INVISIBLEINK and other baselines, in this section. We also note that to allow comparisons between methods, we convert all DP guarantees into (ε, δ) guarantees13. INVISIBLEINK. INVISIBLEINK, described in Algorithm 1, has the following hyperparameters: the clipping norm C, the batch size B, the sampling temperature τ and the top-k sampling parameter.