Kernel Identification Through Transformers

Authors: Fergus Simpson, Ian Davies, Vidhi Lalchand, Alessandro Vullo, Nicolas Durrande, Carl Edward Rasmussen

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train our model using synthetic data generated from priors over a vocabulary of known kernels. By exploiting the nature of the self-attention mechanism, KITT is able to process datasets with inputs of arbitrary dimension. We demonstrate that kernels chosen by KITT yield strong performance over a diverse collection of regression benchmarks. [...] In this section we explore KITT’s ability to predict kernels for synthetic data and standard regression benchmarks. We present four baselines to assess the performance of KITT on regression data.
Researcher Affiliation Collaboration Fergus Simpson Secondmind Cambridge, UK fergus@secondmind.ai Ian Davies Insta Deep London, UK Vidhi Lalchand University of Cambridge Cambridge, UK Alessandro Vullo Secondmind Cambridge, UK Nicolas Durrande Secondmind Cambridge, UK Carl Rasmussen University of Cambridge Cambridge, UK
Pseudocode No The paper describes the model architecture and inference process, and includes Figure 1 as a schematic diagram, but it does not contain any formal pseudocode blocks or sections explicitly labeled “Algorithm”.
Open Source Code Yes The code is available at https://github.com/frgsimpson/kitt. [...] We include code and a pretrained KITT model in the supplementary material.
Open Datasets Yes We train our model using synthetic data generated from priors over a vocabulary of known kernels. [...] We evaluate KITT on eight real-world UCI regression tasks, spanning a range of input sizes and input dimensions (from 4 to 14). We adopt the same benchmarking methodology as Liu et al. [19].
Dataset Splits Yes We adopt the same benchmarking methodology as Liu et al. [19], which includes a 90/10 train/test split, and subsampling 2,000 datapoints for those cases where the dataset exceeds this number.
Hardware Specification Yes The KITT network was trained on a Tesla V100 GPU for approximately eight hours, with Adam [16].
Software Dependencies No The paper mentions using the Adam optimizer and references GPflow implicitly, but it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the experiments.
Experiment Setup Yes during training we restrict ourselves to the case of 4 input dimensions and 64 input points per sample. [...] The Adam optimiser was used with an initial learning rate of 10 4, and a decay schedule with a decay rate of 0.1 every 50,000 iterations. Due to the relatively noisy nature of the classification task, a relatively large batch size of 128 was found to be beneficial. The vocabulary included product kernels of at most two terms in addition to the primitive kernels, yielding a final vocabulary of size 34. [...] We adopt the same benchmarking methodology as Liu et al. [19], which includes a 90/10 train/test split, and subsampling 2,000 datapoints for those cases where the dataset exceeds this number. For each dataset, we predict a kernel caption with a maximum expression length of three terms.