reproducibilityindex.ai

Simple, Distributed, and Accelerated Probabilistic Programming

Authors: Dustin Tran, Matthew W. Hoffman, Dave Moore, Christopher Suter, Srinivas Vasudevan, Alexey Radul

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate three applications: a model-parallel variational auto-encoder (VAE) [24] with TPUs; a data-parallel autoregressive model (Image Transformer [31]) with TPUs; and multi-GPU No-U-Turn Sampler (NUTS) [21]. For both a state-of-the-art VAE on 64x64 Image Net and Image Transformer on 256x256 Celeb A-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan [8] and 37x over Py MC3 [39].
Researcher Affiliation	Industry	Dustin Tran Matthew D. Hoffman Dave Moore Christopher Suter Srinivas Vasudevan Alexey Radul Matthew Johnson Rif A. Saurous Google Brain, Google
Pseudocode	Yes	Figure 5: Minimal implementation of tracing. and Figure 10: Core logic in No-U-Turn Sampler [21].
Open Source Code	Yes	All code, including experiments and more details from code snippets displayed here, is available at http://bit.ly/2Jp Fipt.
Open Datasets	Yes	For both a state-of-the-art VAE on 64x64 Image Net and Image Transformer on 256x256 Celeb A-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan [8] and 37x over Py MC3 [39].
Dataset Splits	No	The paper mentions using 64x64 Image Net, 256x256 Celeb A-HQ, and Covertype dataset but does not explicitly provide specific training/validation/test splits (percentages or counts) or refer to standard predefined splits for reproduction.
Hardware Specification	Yes	CPU experiments use a six-core Intel E5-1650 v4, GPU experiments use 1-8 NVIDIA Tesla V100 GPUs, and TPU experiments use 2nd generation chips under a variety of topology arrangements. The TPUv2 chip comprises two cores: each features roughly 22 teraﬂops on mixed 16/32-bit precision (it is roughly twice the ﬂops of a NVIDIA Tesla P100 GPU on 32-bit precision).
Software Dependencies	Yes	Code snippets assume tensorflow==1.12.0.
Experiment Setup	Yes	In all distributed experiments, we cross-shard the optimizer for data-parallelism: each shard (core) takes a batch size of 1. For 256x256 Celeb A-HQ, we use a relatively small Image Transformer [31] in order to ﬁt the model in memory. It applies 5 layers of local 1D self-attention with block length of 256, hidden sizes of 128, attention key/value channels of 64, and feedforward layers with a hidden size of 256.