Simple, Distributed, and Accelerated Probabilistic Programming

Authors: Dustin Tran, Matthew W. Hoffman, Dave Moore, Christopher Suter, Srinivas Vasudevan, Alexey Radul

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate three applications: a model-parallel variational auto-encoder (VAE) [24] with TPUs; a data-parallel autoregressive model (Image Transformer [31]) with TPUs; and multi-GPU No-U-Turn Sampler (NUTS) [21]. For both a state-of-the-art VAE on 64x64 Image Net and Image Transformer on 256x256 Celeb A-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan [8] and 37x over Py MC3 [39].
Researcher Affiliation Industry Dustin Tran Matthew D. Hoffman Dave Moore Christopher Suter Srinivas Vasudevan Alexey Radul Matthew Johnson Rif A. Saurous Google Brain, Google
Pseudocode Yes Figure 5: Minimal implementation of tracing. and Figure 10: Core logic in No-U-Turn Sampler [21].
Open Source Code Yes All code, including experiments and more details from code snippets displayed here, is available at http://bit.ly/2Jp Fipt.
Open Datasets Yes For both a state-of-the-art VAE on 64x64 Image Net and Image Transformer on 256x256 Celeb A-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan [8] and 37x over Py MC3 [39].
Dataset Splits No The paper mentions using 64x64 Image Net, 256x256 Celeb A-HQ, and Covertype dataset but does not explicitly provide specific training/validation/test splits (percentages or counts) or refer to standard predefined splits for reproduction.
Hardware Specification Yes CPU experiments use a six-core Intel E5-1650 v4, GPU experiments use 1-8 NVIDIA Tesla V100 GPUs, and TPU experiments use 2nd generation chips under a variety of topology arrangements. The TPUv2 chip comprises two cores: each features roughly 22 teraflops on mixed 16/32-bit precision (it is roughly twice the flops of a NVIDIA Tesla P100 GPU on 32-bit precision).
Software Dependencies Yes Code snippets assume tensorflow==1.12.0.
Experiment Setup Yes In all distributed experiments, we cross-shard the optimizer for data-parallelism: each shard (core) takes a batch size of 1. For 256x256 Celeb A-HQ, we use a relatively small Image Transformer [31] in order to fit the model in memory. It applies 5 layers of local 1D self-attention with block length of 256, hidden sizes of 128, attention key/value channels of 64, and feedforward layers with a hidden size of 256.