Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robust Data Valuation with Weighted Banzhaf Values

Authors: Weida Li, Yaoliang Yu

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical study shows that the Banzhaf value is not always the most robust when compared with a broader family: weighted Banzhaf values. To analyze this scenario, we introduce the concept of Kronecker noise to parameterize stochasticity, through which we prove that the uniquely robust semi-value, which can be analytically derived from the underlying Kronecker noise, lies in the family of weighted Banzhaf values while minimizing the worst-case entropy. In addition, we adopt the maximum sample reuse principle to design an estimator to efficiently approximate weighted Banzhaf values, and show that it enjoys the best time complexity in terms of achieving an (ϵ, δ)-approximation. Our theory is verified under both synthetic and authentic noises. For the latter, we fit a Kronecker noise to the inherent stochasticity, which is then plugged in to generate the predicted most robust semi-value. Our study suggests that weighted Banzhaf values are promising when facing undue noises in data valuation.
Researcher Affiliation	Academia	Weida Li EMAIL Yaoliang Yu School of Computer Science University of Waterloo Vector Institute EMAIL
Pseudocode	No	The paper describes the estimation process using mathematical formulas (e.g., Eq. 4) and prose, but it does not include a clearly labeled pseudocode block or algorithm section.
Open Source Code	Yes	Our code is available at https://github.com/watml/weighted-Banzhaf.
Open Datasets	Yes	All datasets used are from open sources, and are classification tasks. Except for MNIST and FMNIST, each Dtr or Dval is balanced between different classes. Without explicitly stated, we set \|Dval\| = 200. All utility functions are set to be the accuracy reported on Dval with logistic regression models being trained on Dtr, except that we implement Le Net (Le Cun et al. 1998) for MNIST and FMNIST. (...) The datasets we use in the main paper are summarized in Table 4.
Dataset Splits	Yes	Let Dtr and Dval be training and validation datasets, respectively. We write n = \|Dtr\| (...) Without explicitly stated, we set \|Dval\| = 200. All utility functions are set to be the accuracy reported on Dval with logistic regression models being trained on Dtr, except that we implement Le Net (Le Cun et al. 1998) for MNIST and FMNIST. (...) We fix \|Dtr\| = 1, 000 for all datasets except that it is \|Dtr\| = 2, 000 for MNIST and FMNIST.
Hardware Specification	No	The paper does not specify any hardware details such as GPU models, CPU types, or cloud computing instances used for running the experiments.
Software Dependencies	No	The paper mentions implementing 'Le Net (Le Cun et al. 1998) for MNIST and FMNIST' and 'logistic regression models'. However, it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup	Yes	All utility functions are set to be the accuracy reported on Dval with logistic regression models being trained on Dtr, except that we implement Le Net (Le Cun et al. 1998) for MNIST and FMNIST. To have the merit of efficiency, we adopt one-epoch onemini-batch learning for training models in all types of experiments (Ghorbani and Zou 2019). (...) Besides, the learning rate is set to be 1.0. (...) The learning rate is set to be 0.05. (...) The total number of utility evaluations is set to be 400, 000. (...) For each dataset, we randomly flip the labels of 20 percent of data in Dtr to be any of the rest in a uniform manner.