Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Secure and Confidential Certificates of Online Fairness

Authors: Olive Franzese, Ali Shahin Shamsabadi, Carter Luck, Hamed Haddadi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Empirical Evaluation Objectives. We empirically evaluate the efficiency, scalability, and correctness of OATH for providing a repeatable fairness audit of ML-based services while protecting the confidentiality of evaluation data and model parameters. Datasets. We consider five common datasets for fairness benchmarking (described in Appendix H): COMPAS [2], Crime [36], Default Credit [51], Adult [3] and German Credit [15]. Implementation. We implement OATH in C++ using EMP-toolkit [46] (under an MIT License). All experiments were conducted by locally simulating the parties on a Mac Book Pro laptop CPU. Models. OATH uses a zero-knowledge proof of correct inference Finf as a subroutine in order to accommodate arbitrary binary classifiers. We evaluate OATH using three different settings for Finf: (i) logistic regression (LR) implemented in EMP-toolkit, (ii) a small feed-forward neural network (FFNN) with ReLU activations suitable for tabular data implemented in EMP-toolkit, (iii) larger neural networks suitable for image data using Mystique [48] with three different networks Le Net-5 (62K parameters), Res Net-50 (23.5 mil parameters), and Res Net-101 (42.5 mil parameters). Baselines. We compare the efficiency of OATH against a baseline method for confidential online fairness certificates of group fairness.
Researcher Affiliation	Collaboration	Olive Franzese University of Toronto & Vector Institute Toronto, ON EMAIL Ali Shahin Shamsabadi Brave Software London, UK Carter Luck University of Massachusetts, Amherst Amherst, MA Hamed Haddadi Imperial College London & Brave Software London, UK
Pseudocode	Yes	Algorithm 1: OATH Committed Query Answering. Input: Public: security parameter λ; C: query point q; P: model M; V: no inputs. Output: C: model decision o, randomness r; P: query point q, randomness r; V: commitment string C = H(q\|\|o\|\|r). ... Algorithm 2: OATH Zero-Knowledge Fairness Audit. Input: public: the number of client queries n, fairness gap threshold θ, soundness parameter ν; P: model M, online data Q = {(qi, αi s, oi, ri)}n i=1; V: commitments {Ci}n i=1, sensitive attribute check strings {(αi 0, αi 1)}n i=1 Output: V obtains bpass {0, 1} indicating whether M satisfies demographic parity with respect to Q. ... Algorithm 3: OATH Zero-Knowledge Equalized Odds Audit. ... Algorithm 4: OATH Equalized Odds Calculation. ... Algorithm 5: Blame Attribution ... Algorithm 6: Group-Balanced Uniform Sample ... Algorithm 7: Fairness Audit w/o S Reveal.
Open Source Code	Yes	Our code is publicly available at https://github.com/cleverhans-lab/ oath-zk-online-fairness.git.
Open Datasets	Yes	Datasets. We consider five common datasets for fairness benchmarking (described in Appendix H): COMPAS [2], Crime [36], Default Credit [51], Adult [3] and German Credit [15].
Dataset Splits	No	Our experiments are runtime benchmarks, and they run identically regardless of training and test details. (NeurIPS Paper Checklist - Question 6 Justification). The paper states: "We assume 10^6 total client queries, 7600 of which are randomly selected for consistency and correctness checks." However, this refers to the sampling strategy for the auditing process (for `ν` queries as described in Section 5.2), not the traditional training/validation/test splits for the ML models themselves on the datasets.
Hardware Specification	Yes	All experiments were conducted by locally simulating the parties on a Mac Book Pro laptop CPU.
Software Dependencies	No	We implement OATH in C++ using EMP-toolkit [46] (under an MIT License).
Experiment Setup	Yes	The second and third columns of Table 1 report Audit Phase runtime when OATH is verifying either a LR or FFNN model respectively. This is the time required to audit the fairness of authenticated client query answers with our ZKP protocols. We assume 10^6 total client queries, 7600 of which are randomly selected for consistency and correctness checks. This size of the random sample was identified empirically as a good tradeoff between efficient runtime and strong tamper protection (see Section 5.2 for details). ... For ν = 3800, the same ϵ evades with probability at most 5.34 x 10^-9. We select this value as a good tradeoff between efficiency and reliability. See Figure 3 for possible ϵ deviations from an example group fairness threshold θ at varying numbers of verified queries.