Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions
Authors: Parastoo PASHMCHI, Jérôme Benoit, Motonobu Kanagawa
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments illustrate the performance of k NNSampler. The code for k NNSampler is made publicly available.1 We report experimental results on synthetic data in Section 4 and on real solar-power data in Section 5. |
| Researcher Affiliation | Collaboration | Parastoo Pashmchi EMAIL SAP Labs France E-Mobility Research EURECOM, Sophia Antipolis, France Jérôme Benoit EMAIL SAP Labs France E-Mobility Research Motonobu Kanagawa EMAIL EURECOM, Sophia Antipolis, France |
| Pseudocode | Yes | Algorithm 1: k NNSampler Input: Number of nearest neighbors k, observed covariates x1, . . . , xm X with missing responses, observed covariate-response pairs (x1, y1), . . . , (xn, yn) X Y. Output: Imputed responses ˆy1,imp, . . . , ˆym,imp Y. for i = 1 to m do ˆyi,imp := yj, where j {1, . . . , n} is uniformly sampled from NN( xi, k, Xn) in equation 4, the indices of the k-nearest neighbors of xi in Xn = {x1, . . . , xn}. end |
| Open Source Code | Yes | The code for k NNSampler is made publicly available.1 1https://github.com/SAP/knn-sampler |
| Open Datasets | Yes | We use a Kaggle dataset8 that contains solar panel DC powers (responses) and the corresponding irradiations (covariates), totaling 67,698 covariate-response pairs. 8https://www.kaggle.com/datasets/samuelkamau/solar-data/ |
| Dataset Splits | Yes | The number k of nearest neighbors k is a hyperparameter of k NNSampler. The theoretical and empirical results below indicate that k should not be fixed to a prespecified value (e.g., k = 5), and should be chosen depending on the available data. One way is to perform cross-validation for k NN regression on the data (x1, y1), . . . , (xn, yn) and select k among candidate values that minimizes the mean-square error on held-out observed responses, averaged over different training-validation splits. In particular, the present work uses Leave-One-Out Cross-Validation (LOOCV) using the fast computation method recently proposed by Kanagawa (2024). |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory) are mentioned in the paper for running experiments. |
| Software Dependencies | No | k NNImputer (Troyanskaya et al., 2001) is one of the most widely used imputation methods, owing to its simplicity and availability in popular software packages such as scikit-learn2 (Pedregosa et al., 2011). The paper mentions 'scikit-learn' but does not specify its version number or versions for other software dependencies. |
| Experiment Setup | Yes | We set the number k of nearest neighbours as k = 5, which is the default setting in scikit-learn and widely used in practice. We use the authors recommended settings: inverse temperature τ = 50 and kernel bandwidth h = 0.03. The number k of nearest neighbours for k NNSampler is determined by the fast leaveone-out cross-validation method of Kanagawa (2024) using the observed covariate-response pairs. Specifically, we set n {2800, 4800, 6800, 8800, 10800}. |