Simple, Distributed, and Accelerated Probabilistic Programming
Authors: Dustin Tran, Matthew W. Hoffman, Dave Moore, Christopher Suter, Srinivas Vasudevan, Alexey Radul
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate three applications: a model-parallel variational auto-encoder (VAE) [24] with TPUs; a data-parallel autoregressive model (Image Transformer [31]) with TPUs; and multi-GPU No-U-Turn Sampler (NUTS) [21]. For both a state-of-the-art VAE on 64x64 Image Net and Image Transformer on 256x256 Celeb A-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan [8] and 37x over Py MC3 [39]. |
| Researcher Affiliation | Industry | Dustin Tran Matthew D. Hoffman Dave Moore Christopher Suter Srinivas Vasudevan Alexey Radul Matthew Johnson Rif A. Saurous Google Brain, Google |
| Pseudocode | Yes | Figure 5: Minimal implementation of tracing. and Figure 10: Core logic in No-U-Turn Sampler [21]. |
| Open Source Code | Yes | All code, including experiments and more details from code snippets displayed here, is available at http://bit.ly/2Jp Fipt. |
| Open Datasets | Yes | For both a state-of-the-art VAE on 64x64 Image Net and Image Transformer on 256x256 Celeb A-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan [8] and 37x over Py MC3 [39]. |
| Dataset Splits | No | The paper mentions using 64x64 Image Net, 256x256 Celeb A-HQ, and Covertype dataset but does not explicitly provide specific training/validation/test splits (percentages or counts) or refer to standard predefined splits for reproduction. |
| Hardware Specification | Yes | CPU experiments use a six-core Intel E5-1650 v4, GPU experiments use 1-8 NVIDIA Tesla V100 GPUs, and TPU experiments use 2nd generation chips under a variety of topology arrangements. The TPUv2 chip comprises two cores: each features roughly 22 teraflops on mixed 16/32-bit precision (it is roughly twice the flops of a NVIDIA Tesla P100 GPU on 32-bit precision). |
| Software Dependencies | Yes | Code snippets assume tensorflow==1.12.0. |
| Experiment Setup | Yes | In all distributed experiments, we cross-shard the optimizer for data-parallelism: each shard (core) takes a batch size of 1. For 256x256 Celeb A-HQ, we use a relatively small Image Transformer [31] in order to fit the model in memory. It applies 5 layers of local 1D self-attention with block length of 256, hidden sizes of 128, attention key/value channels of 64, and feedforward layers with a hidden size of 256. |