Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Faithful Group Shapley Value

Authors: Kiljae Lee, Ziqi Liu, Weijing Tang, Yuan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical experiments demonstrate that our algorithm significantly outperforms state-of-the-art methods in computational efficiency and approximation accuracy, while ensuring faithful group-level valuation.
Researcher Affiliation	Academia	Kiljae Lee The Ohio State University EMAIL Ziqi Liu Carnegie Mellon University EMAIL Weijing Tang Carnegie Mellon University EMAIL Yuan Zhang The Ohio State University EMAIL
Pseudocode	Yes	Algorithm 1 Approximate FGSV(S0) Require: Dataset D, group S0, threshold s, subsample sizes m1, m2. 1: Initialize n = \|D\|, s0 = \|S0\| and α0 = s0/n. 2: for s = 1 to n 1 do 3: if s < s then 4: Estimate bµm1( s1 s ; s, s0, n) for each s1 [max{0, s+s0 n}, min{s, s0}] by Eq. (10). 5: Compute b T (s) by (8), replacing µ by bµm1. 6: else 7: s 1 sα0 . 8: Estimate c µm2( s 1 s ; s, s0, n) by Eq. (11). 9: b T (s) n n 1α0(1 α0) c µm2( s 1 s ; s, s0, n). 10: end if 11: end for 12: return s0 n [U([n]) U( )] + Pn 1 s=1 b T (s).
Open Source Code	Yes	The code and instructions to reproduce the experiments are provided in the supplementary material and available at https://github.com/Kiljae L/Faithful_GSV.
Open Datasets	Yes	Following [35], we fine-tune Stable Diffusion v1.4 [28] using Low-Rank Adaptation (Lo RA; [13]) on four brand logos from Flickr Logo-27 [16]. ... We conduct our experiment on the Diabetes dataset [5], which contains 442 individuals, each described by 10 demographic and health-related features (e.g., sex, age, and BMI).
Dataset Splits	Yes	Experimental setup. Following [35], we fine-tune Stable Diffusion v1.4 [28] using Low-Rank Adaptation (Lo RA; [13]) on four brand logos from Flickr Logo-27 [16]. The utility U( ; x(gen)) is the average log-likelihood of generating 20 brand-specific images x(gen) using the prompt A logo by [brand name] (see example images in Panel (a) of Figure 3). We compare SRS and FSRS under two grouping scenarios: (1) 30 images from each brand form a single group, and (2) the Google and Sprite datasets are each split into two subgroups (20/10 images), launching a shell company attack.
Hardware Specification	Yes	The paper provides detailed information about the compute environment for each experiment, including CPU/GPU specifications. Resource-demanding experiments, such as those involving generative AI models, include descriptions of GPU hardware and fine-tuning time.
Software Dependencies	No	The paper mentions using Stable Diffusion v1.4 [28] and Low-Rank Adaptation (Lo RA; [13]), but does not specify software versions for programming languages, libraries, or other dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Given a fixed budget of 20,000 utility function evaluations, we record the absolute relative error of the FGSV estimate every 200 iterations for each method. ... The utility U( ; x(gen)) is the average log-likelihood of generating 20 brand-specific images x(gen) using the prompt A logo by [brand name]. ... Our predictive model is ridge regression, and we measure utility as the negative mean squared error on a held-out test set, with the null utility set to the variance of the test responses. ... For each grouping scheme, we compute GSV exactly and estimate FGSV via 30 Monte Carlo replications.