Memory safe computations with XLA compiler

Authors: Artem Artemev, Yuze An, Tilman Roeder, Mark van der Wilk

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that k-nearest neighbour, sparse Gaussian process regression methods and Transformers can be run on a single device at a much larger scale, where standard implementations would have failed. Our approach leads to better use of hardware resources.
Researcher Affiliation Collaboration Artem Artemev Imperial College London Secondmind a.artemev20@imperial.ac.uk Yuze An Imperial College London yuze.an21@imperial.ac.uk Tilman Roeder Imperial College London tilman.roeder17@imperial.ac.uk Mark van der Wilk Imperial College London m.vdwilk@imperial.ac.uk
Pseudocode Yes Algorithm 1: High-level description of the depth-first search visitor-handler that splits the data-flow graph up to the reduction dot operation.
Open Source Code Yes 1The code is available at https://github.com/awav/tensorflow.
Open Datasets Yes We use randomly generated data, common benchmarks like MNIST and Fashion-MNIST, and Glove-50, Glove-100 and Glove-200 from the ANN-benchmark toolkit Aumüller et al. (2020). We conduct experiments on a Tesla V100 GPU with 32 GB of memory, and run on two of the largest UCI datasets that are commonly considered in Gaussian process research: 3droad and houseelectric with total number of data points 434,874 and 2,049,280 respectively.
Dataset Splits No The paper does not provide explicit percentages or counts for training/validation/test dataset splits, nor does it refer to specific predefined splits with citations for all three subsets.
Hardware Specification Yes We evaluated the expression in double precision on a Tesla V100 GPU with 32 GB of memory, and applied a range of memory limits. We conduct experiments on a Tesla V100 GPU with 32 GB of memory. We run experiments on a single Nvidia V100 GPU with 32GB memory.
Software Dependencies Yes We demonstrate the utility of e XLA by scaling the GPflow (Matthews et al., 2017, 2.3.1 release version) implementation of Sparse Gaussian process regression (SGPR, Titsias, 2009), without any modifications of the code.
Experiment Setup Yes In all benchmarks, we set the tensor size threshold for e XLA to 100MB for simplicity, even though this may not be optimal for performance. We set the tensor size threshold and the tensor split size in e XLA to 1GB. The Transformer model compiled with e XLA optimisations managed to run with sequences up to 7000 with tensor limit set to 10GB and the tensor split size set to 1GB.