Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Conditional Distribution Compression via the Kernel Conditional Mean Embedding
Authors: Dominic Broadbent, Nick Whiteley, Robert Allison, Tom Lovett
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments Building on the experimental setup of Kernel Herding [3], we demonstrate how the methods proposed in this paper can be applied to compress the conditional distribution. We report the root mean square error RMSE(C) := 1 n Pn i=1 E[h(Y ) | X = xi] ˆµC Y |X=xi, h Hl 2 , across a range of test functions h : Y R. This aligns with the standard applications of the KCME [17 33], where one estimates the conditional expectation of a function of interest h. When the exact value of the conditional expectation is unavailable, we approximate E[h(Y ) | X = xi] via its full-data estimate, ˆµD Y |X=xi, h Hl. Note that when h is the identity function, the estimate reduces to the familiar regression setting i.e. E[Y | X], and when h is an indicator function, it corresponds to estimating class-conditional probabilities e.g. P(Y = 0 | X). For full details of the experiments, including results on additional test functions omitted from the main text due to space constraints, see Section C. 5.1 Matching the True Conditional Distribution In general, the expectations in (2), (3), (8), and (10) must be estimated. However, when the kernels k and l are Gaussian, and we let PX = N(µ, σ2) and PY |X = N(a0 + a1X, σ2 ϵ ) for µ, a0, a1 R and σ2, σ2 ϵ > 0, the integrals can be evaluated analytically. See Section C.3 for details. We construct compressed sets of size m = 500, and compute the AMCMD2 h PX, PY |X, PY |X i achieved by each method. Figures 2 and3 highlight the advantages of directly targeting the conditional distribution, with ACKH and ACKIP achieving lower AMCMD compared to JKH and JKIP. Additionally, in the case of ACKIP versus ACKH, it demonstrates the superiority of joint optimisation over herding, where the inability to revisit previous selections limits ACKH s performance in comparison to ACKIP. Moreover, we can see that the reduced AMCMD achieved by ACKH and ACKIP translates to improved performance in estimating conditional expectations across a variety of test functions. 5.2 Matching the Empirical Conditional Distribution In this section, we present experiments targeting the empirical conditional distribution of synthetic and real-data. Across all datasets, we generate or subsample down to n = 10, 000 pairs, split off 10% for validation, 10% for testing, and construct compressed sets up to size m = 250. |
| Researcher Affiliation | Academia | Dominic Broadbent School of Mathematics University of Bristol Bristol, United Kingdom EMAIL Nick Whiteley School of Mathematics University of Bristol Bristol, United Kingdom Robert Allison School of Mathematics University of Bristol Bristol, United Kingdom Tom Lovett Mathematical Institute University of Oxford Oxford, United Kingdom |
| Pseudocode | Yes | D.1 Pseudocode In this section we include pseudocode for the algorithms introduced in this work, including gradientfree variants of the Kernel Herding type algorithms suitable for X = Rd and Y = Rp. In all gradient-based algorithms, the pseudocode assumes standard gradient descent. In practice, however, any gradient descent variant may be used. In our implementation, we employed the Optax [82] package, which provides access to a wide range of gradient-based optimisation methods, including ADAM [81], which we used in our experiments. |
| Open Source Code | Yes | Justification: We use publicly accessible datasets, and all details required to reproduce the experiments are included in Section C. Code to reproduce the results can be found at https: //github.com/conditionaldummy/dummy_repo, with scripts to run experiments and a notebook to make figures included in the supplemental material. |
| Open Datasets | Yes | Real: We use the Superconductivity dataset from UCI [44]. Superconductivity is composed of d = 81 features relating to the chemical composition of superconductors with the target being its critical temperature [45]. Real: We use MNIST [83, 84], where we subsample down to n = 10, 000 due to memory limitations, splitting off 10% for validation and another 10% for testing. |
| Dataset Splits | Yes | Across all datasets, we generate or subsample down to n = 10, 000 pairs, split off 10% for validation, 10% for testing, and construct compressed sets up to size m = 250. |
| Hardware Specification | Yes | All the experiments were performed on a single NVIDIA GTX 4070 Ti with 12GB of memory, CUDA 12.2 with driver 535.183.01 and JAX version 0.4.35. |
| Software Dependencies | Yes | All the experiments were performed on a single NVIDIA GTX 4070 Ti with 12GB of memory, CUDA 12.2 with driver 535.183.01 and JAX version 0.4.35. For all experiments, we use the Adam optimiser [81] as a sensible default choice. However, the implementations of ACKIP and ACKH allow for an arbitrary choice of optimiser via the Optax package [82]. |
| Experiment Setup | Yes | The regularisation parameter λ > 0 is selected using a two-stage cross-validation procedure on a validation set consisting of 10% of the data. In the first stage, a coarse grid of candidate λ values are used to identify a preliminary range for the regularisation parameter. In the second stage, a finer grid is constructed within this range, and searched over. To avoid the O(n3) cost of training the KCME, we randomly sample 10 subsets of the training data of size 1000 such that the optimal λ value is averaged over these random training sets to determine the final regularisation parameter. For all experiments, we use the Adam optimiser [81] as a sensible default choice. However, the implementations of ACKIP and ACKH allow for an arbitrary choice of optimiser via the Optax package [82]. We set a default learning rate of 0.01 across all experiments. To provide reasonable seeds for minimisation across each iteration of JKH and ACKH, we follow the approach of [3] and draw 10 random auxiliary pairs from the training data, choosing the seed to be the auxiliary sample which achieves the smallest value of the relevant loss. In comparison, to initialise JKIP and ACKIP, we draw 10 auxiliary sets of size m, and choose the initial seed set to be the auxiliary set which achieves the smallest value of the relevant loss. |