The Inductive Bias of Quantum Kernels

Authors: Jonas Kübler, Simon Buchholz, Bernhard Schölkopf

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Since for small d we can simulate the biased kernel efficiently, we illustrate our theoretical findings in the following experiments. Our implementation, building on standard open source packages [50, 51], is available online.4 We consider the case described above where we know that the data was generated by measuring an observable on the first qubit, i.e., f (x) = Tr ρV 1 (x)M , but we do not know M , see Fig. 1. We use the full kernel k and the biased kernel q for the case m = 1. To show the effect of selecting the wrong bias, we also include the behavior of a biased kernel defined only on the second qubit, denoted as qw. As a classical reference we also include the performance of a radial basis function kernel krbf(x, x ) = exp( x x 2/2). For the experiments we fix a single qubit observable M = σz and perform the experiment for varying number d of qubits. First we draw a random unitary V . The dataset is then generated by drawing N = 200 realizations {x(i)}N i=1 from the d dimensional uniform distribution on [0, 2π]d. We then define the labels as y(i) = cf (x(i)) + ϵ(i), where f (x) = Tr ρV (x)σz , ϵ(i) is Gaussian noise with Var[ϵ] = 10 4, and c is chosen such that Var[f(X)] = 1. Keeping the variances fixed ensures that we can interpret the behavior for varying d. We first verify our findings from Theorem 2b) and Equation (11) by estimating the spectrum of q. Fig. 2 (left) shows that Theorem 2b) also holds for individual V with high probability. We then use 2/3 of the data for training kernel ridge regression (we fit the mean seperately) with preset regularization, and use 1/3 to estimate the test error. We average the results over ten random seeds (random V , x(i), ϵ(i)) and results are reported in the right panel of Fig. 2. This showcases that as the number of qubits increases, it is impossible to learn f without the appropriate spectral bias. k and krbf have too little bias and overfit, whereas qw has the wrong bias and severly underfits. The performance of qw underlines that randomly biasing the kernel does not significantly improve the performance over the full kernel k. In the appendix we show that this is not due to a bad choice of regularization, by reporting cherry-picked results over a range of regularizations. To further illustrate the spectral properties, we empirically estimate the kernel target alignment [30] and the task-model alignment [32] that we introduced in Sec. 2. By using the centered kernel matrix (see App. B) and centering the data we can ignore the first eigenvalue in (11) corresponding the constant function. In Figure 3 (left) we show the empirical (centered) kernel target alignment for 50 random seeds. The biased kernel is the only one well aligned with the task. The right panel of Fig. 3 shows the task model alignment. This shows that f can be completely expressed with the first four components of the biased kernel, while the other kernels essentially need the entire spectrum (we use a sample size of 200, hence the empirical kernel matrix is only 200 dimensional) and thus are unable to learn. Note that the kernel qw is four dimensional, and so higher contributions correspond to functions outside its RKHS that it actually cannot even learn at all.
Researcher Affiliation Academia Jonas M. Kübler Simon Buchholz Bernhard Schölkopf Max Planck Institute for Intelligent Systems Tübingen, Germany {jmkuebler, sbuchholz, bs}@tue.mpg.de
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Our implementation, building on standard open source packages [50, 51], is available online.4 ... 4https://github.com/jmkuebler/quantumbias
Open Datasets No The dataset is then generated by drawing N = 200 realizations {x(i)}N i=1 from the d dimensional uniform distribution on [0, 2π]d. This is a synthetic dataset generated by the authors, and no public access link or citation is provided for it.
Dataset Splits No We then use 2/3 of the data for training kernel ridge regression (we fit the mean seperately) with preset regularization, and use 1/3 to estimate the test error. The paper specifies a train/test split, but no explicit validation split.
Hardware Specification No The paper mentions simulating the biased kernel efficiently but does not provide any specific details about the hardware used for these simulations (e.g., CPU, GPU models, or memory).
Software Dependencies No Our implementation, building on standard open source packages [50, 51], is available online. Footnotes [50] and [51] refer to 'Scikit-learn' and 'Pennylane', respectively. However, no specific version numbers are provided for these software packages.
Experiment Setup Yes We consider the case described above where we know that the data was generated by measuring an observable on the first qubit, i.e., f (x) = Tr ρV 1 (x)M , but we do not know M , see Fig. 1. We use the full kernel k and the biased kernel q for the case m = 1. To show the effect of selecting the wrong bias, we also include the behavior of a biased kernel defined only on the second qubit, denoted as qw. As a classical reference we also include the performance of a radial basis function kernel krbf(x, x ) = exp( x x 2/2). For the experiments we fix a single qubit observable M = σz and perform the experiment for varying number d of qubits. First we draw a random unitary V . The dataset is then generated by drawing N = 200 realizations {x(i)}N i=1 from the d dimensional uniform distribution on [0, 2π]d. We then define the labels as y(i) = cf (x(i)) + ϵ(i), where f (x) = Tr ρV (x)σz , ϵ(i) is Gaussian noise with Var[ϵ] = 10 4, and c is chosen such that Var[f(X)] = 1. Keeping the variances fixed ensures that we can interpret the behavior for varying d. We first verify our findings from Theorem 2b) and Equation (11) by estimating the spectrum of q. Fig. 2 (left) shows that Theorem 2b) also holds for individual V with high probability. We then use 2/3 of the data for training kernel ridge regression (we fit the mean seperately) with preset regularization, and use 1/3 to estimate the test error. We average the results over ten random seeds (random V , x(i), ϵ(i)) and results are reported in the right panel of Fig. 2.