Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Graph–Smoothed Bayesian Black-Box Shift Estimator and Its Information Geometry

Authors: Masanari Kimura

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 7 Experiments 7.1 Experimental Protocol and Implementation Datasets and synthetic label shifts. We evaluate on MNIST (K = 10) [14], CIFAR-10 (K = 10) and CIFAR-100 (K = 100) datasets [32]. For each dataset we treat the official training split as the source domain and the official test split as the pool from which an unlabelled target domain is drawn... Main Empirical Findings Table 3 compares GS-B3SE with four widely used point estimators on three datasets.
Researcher Affiliation Academia Masanari Kimura School of Mathematics and Statistics The University of Melbourne EMAIL
Pseudocode No The paper describes the joint Bayesian model and inference steps (HMC or block Newton CG) in Section 4 'Methodology' and Section 7 'Experiments' but does not include a dedicated pseudocode or algorithm block.
Open Source Code Yes The codes for numerical experiments are submitted as the supplemental material.
Open Datasets Yes We evaluate on MNIST (K = 10) [14], CIFAR-10 (K = 10) and CIFAR-100 (K = 100) datasets [32].
Dataset Splits Yes For each dataset we treat the official training split as the source domain and the official test split as the pool from which an unlabelled target domain is drawn. Source set: Sample 10, 000 instances from the training partition according to p and train a backbone classifier (Res Net-18 [22, 48]) for 100 epochs with standard data-augmentation. ii) Validation set: Hold out 5, 000 labelled source instances, stratified by p, to estimate the empirical confusion matrix C. iii) Target set: Draw n = 10, 000 unlabelled instances from the test partition using probabilities q.
Hardware Specification Yes All routines implemented in Py MC and run on a single NVIDIA T4.
Software Dependencies No All routines implemented in Py MC and run on a single NVIDIA T4. While "Py MC" is mentioned, no specific version number is provided for it or any other software dependencies, which is required for a reproducible description.
Experiment Setup Yes Gamma hyper-priors: aq = bq = a C = b C = 1, giving vague Gamma(1, 1) on τq and τC. Four independent HMC chains, each with 500 warm-up (NUTS) and 1,000 posterior iterations; leap-frog step-size adaptively tuned. Block Newton CG inner optimizer: tolerance 10 4, at most eight iterations per Newton step, stop when the relative change of the joint log-density falls below 10 3.