Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Integral Probability Metrics Meet Neural Networks: The Radon-Kolmogorov-Smirnov Test

Authors: Seunghoon Paik, Michael Celentano, Alden Green, Ryan J. Tibshirani

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove that the RKS test has asymptotically full power at distinguishing any distinct pair P = Q of distributions, derive its asymptotic null distribution, and carry out experiments to elucidate the strengths and weaknesses of the RKS test versus the more traditional kernel MMD test. We complement our theory with numerical experiments to explore the operating characteristics of the RKS test compared to other popular nonparametric two-sample tests. 4. Experiments
Researcher Affiliation	Academia	Seunghoon Paik1 EMAIL Michael Celentano1 EMAIL Alden Green2 EMAIL Ryan J. Tibshirani1 EMAIL 1Department of Statistics, University of California, Berkeley, CA 94720, USA 2Department of Statistics, Stanford University, Stanford, CA 94305, USA
Pseudocode	Yes	For concreteness, we summarize our computational approach below in Algorithm 1. Algorithm 1 RKS test statistic
Open Source Code	Yes	Python code to replicate all of our experimental results is available at https://github.com/100shpaik/.
Open Datasets	No	For each dimension d, we consider ﬁve settings for P, Q, which are described in Table 1. In each setting, the parameter v controls the discrepancy between P and Q, but its precise meaning depends on the setting. The settings were broadly chosen in order to study the operating characteristics of the RKS test when diﬀerences between P and Q occur in one direction (settings 1 4), and in all directions (setting 5). Among the settings in which the diﬀerences occur in one direction, we also investigate diﬀerent varieties (settings 1 and 2: mean shift under diﬀerent geometries, setting 3: tail diﬀerence, setting 4: variance diﬀerence). Figure 2 visualizes samples from drawn each task in d = 2 dimensions, whereas Figure 3 exaggerates the deviation between P, Q (larger values of v) to better illustrate the geometry. Finally, we note that since the RKS test is rotationally invariant,3 the fact that the chosen diﬀerences in Table 1 are axis-aligned is just a matter of convenience, and the results would not change if these diﬀerences instead occurred along arbitrary directions in Rd. Table 1: Experimental settings. Here Nd(µ, Σ) means the d-dimensional normal distribution with mean µ and covariance Σ, and t(v) means the t distribution with v degrees of freedom.
Dataset Splits	No	We ﬁx the sample sizes to m = n = 512 throughout, and study four choices of dimension: d = 2, 4, 8, 16. For each setting, we compute these test statistics under the null where each xi and yi are sampled i.i.d. from the mixture m/(m+n) P + n/(m+n) Q, and under the alternative where xi are i.i.d. from P and yi from Q. We then repeat this 100 times (draws of samples, and computation of test statistics), and trace out ROC curves true positive versus false positive rates as we vary the rejection threshold for each test.
Hardware Specification	No	For k ≥ 1, we apply the torch.optim.Adam optimizer (a variation on gradient descent), as implemented in PyTorch, to (10). For k = 0, such a ﬁrst-order scheme is not applicable due to the fact that the gradient of the 0th degree ridge spline (wTx − b)0 + = 1{wTx ≥ b} (with respect to w and b) is almost everywhere zero. As a surrogate, we directly approximate the optimum (w, b) in (2) using logistic regression where the class labels identify samples from P versus Q, as implemented in sklearn.linear_model.LogisticRegression in Python.
Software Dependencies	No	For k ≥ 1, we apply the torch.optim.Adam optimizer (a variation on gradient descent), as implemented in PyTorch, to (10). For k = 0, such a ﬁrst-order scheme is not applicable due to the fact that the gradient of the 0th degree ridge spline (wTx − b)0 + = 1{wTx ≥ b} (with respect to w and b) is almost everywhere zero. As a surrogate, we directly approximate the optimum (w, b) in (2) using logistic regression where the class labels identify samples from P versus Q, as implemented in sklearn.linear_model.LogisticRegression in Python.
Experiment Setup	Yes	For k ≥ 1, we apply the torch.optim.Adam optimizer (a variation on gradient descent), as implemented in PyTorch, to (10). We use a betas parameter (0.9, 0.99), learning rate 0.5, number of iterations T = 200, penalty parameter λ = 1, and number of neurons N = 10. To enforce the nonnegativity condition on b, we project b to [0, ∞) after each gradient step. Rather than take the last iterate, we choose the maximal IPM values among the iterates (after rescaling by the RTVk seminorm of each iterate so that it lies in the unit seminorm ball). Further, we repeat this over three random initializations, and select the best resultant IPM value to be the ﬁnal output. We fix the sample sizes to m = n = 512 throughout, and study four choices of dimension: d = 2, 4, 8, 16. For the RKS tests, we examine smoothness degrees k = 0, 1, 2, 3, and we center the input data to have the sample mean zero jointly across both samples.