Transformers Can Do Bayesian Inference
Authors: Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, Frank Hutter
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems, with over 200-fold speedups in multiple setups compared to current methods. We obtain strong results in very diverse areas such as Gaussian process regression, Bayesian neural networks, classification for small tabular data sets, and few-shot image classification, demonstrating the generality of PFNs. In our first set of experiments, we study the capability of PFNs to perform Bayesian inference for the tractable case of Gaussian Processes (GPs) with fixed hyperparameters (where we can compare to ground truth data; Section 5.1) and the intractable cases of GPs with unknown hyperparameters (Section 5.2) and Bayesian Neural Networks (BNNs; Section 5.3). |
| Researcher Affiliation | Collaboration | Samuel M uller1, Noah Hollmann2, Sebastian Pineda1, Josif Grabocka1, Frank Hutter1,3 1University of Freiburg, 2Charit e Berlin, 3Bosch Center for Artificial Intelligence |
| Pseudocode | Yes | Algorithm 1: Training a PFN model by Fitting Prior-Data |
| Open Source Code | Yes | Code and trained PFNs are released at https://github. com/automl/Transformers Can Do Bayesian Inference. |
| Open Datasets | Yes | We used a large collection of tabular datasets from the open-source Open ML Auto ML Benchmark (Gijsbers et al., 2019); we first removed datasets with more than one hundred features or missing values, ending up with 20 datasets that represent a diverse set of classification problems with numerical and categorical features. |
| Dataset Splits | Yes | We also define a set of six unrelated validation datasets used for optimizing the prior distribution over architectures of the PFNs. This is similar to setting the range of hyperparameters in a cross-validation grid search and can be reused for all similar problems. See Appendix G for more details. We used grid search with 5-fold cross-validation to optimize our baselines' hyperparameters for each dataset, using the hyperparameter spaces described in Table 6 in the appendix. for each dataset sampled 20 subsets, each including 100 samples. Within each subset we provide labels for the first 30 samples and evaluate on the remaining samples. |
| Hardware Specification | Yes | when run on a GPU (Nvidia Tesla V100), it requires as little as 13 seconds for all 20 datasets combined. |
| Software Dependencies | No | The paper mentions “Py Torch (Paszke et al., 2019)” and “Pyro (Bingham et al., 2018)” but does not specify their version numbers for reproducibility. |
| Experiment Setup | Yes | For all experiments we used a embedding size of 512, only for few-shot classification we used 1024. The only hyper-parameters that we did fine-tune for the Transformer training were the batch size and learning rate. We used a learning rate of 1e-5 to yield the performance shown in the plot after sampling 500,000 samples from the training tasks. Table 5: Hyperparameters considered during grid search tuning of the PFN-BNN on validation datasets. |