Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Model Shapley: Equitable Model Valuation with Black-box Access
Authors: Xinyi Xu, Thanh Lam, Chuan Sheng Foo, Bryan Kian Hsiang Low
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive empirical validation on the effectiveness of model Shapley using various real-world datasets and heterogeneous model types. Our implementation trains a GPR (as the model appraiser) on the MSVs of a subset of N = 150 models and examines its predictive performance on the remaining ones. |
| Researcher Affiliation | Academia | Dept. of Computer Science, National University of Singapore, Republic of Singapore Inst. for Infocomm Research; Centre for Frontier AI Research, A STAR, Republic of Singapore |
| Pseudocode | No | The paper does not contain any pseudocode blocks or clearly labeled algorithm sections. |
| Open Source Code | Yes | Our implementation is available at https://github.com/Xinyi YS/Model Shapley. |
| Open Datasets | Yes | We train N = 150 independent models on MNIST (CIFAR-10)...We investigate 5 real-world datasets... including MNIST, CIFAR-10 [41], two medical datasets: a drug reviews dataset... (Drug Re) [23] and a medical imaging dataset... (Med NIST) [58], and a cyber-threat detection dataset... (KDD99) [28]. We perform additional experiments on Cov Type [5], MNIST and CIFAR-100. |
| Dataset Splits | Yes | We train a GPR (as the model appraiser) on a random subset of 150 model-MSV pairs to learn to predict the MSV on the remaining pairs. We examine the test performance using two error metrics: mean-squared error (MSE) and maximum error (Max E) w.r.t. varied training ratios from 5% to 50%, in Fig. 2. In particular, results for training ratio of 20% are in Table 2. |
| Hardware Specification | Yes | We perform our experiments on a server with Intel(R) Xeon(R) Gold 6226R CPU @2.90GHz and four NVIDIA Ge Force RTX 3080 s. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks used in the experiments. It mentions an 'automatic differentiation package' in a footnote but without a version. |
| Experiment Setup | No | The paper describes aspects of the experimental setup, such as model types used (e.g., LR, MLP, CNN, ResNet-18, SqueezeNet, DenseNet-121), the kernel for GPR (squared exponential), and data manipulation (e.g., multiplying probability by a factor). However, it lacks crucial hyperparameters for the training of the underlying N=150 models (e.g., learning rates, batch sizes, optimizers, specific epoch counts), which are essential for full reproducibility of the models themselves. |