The streaming rollout of deep networks - towards fully model-parallel execution

Authors: Volker Fischer, Jan Koehler, Thomas Pfeil

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we present a theoretical framework to describe rollouts, the level of model-parallelization they induce, and demonstrate differences in solving specific tasks. We prove that certain rollouts, also for networks with only skip and no recurrent connections, enable earlier and more frequent responses, and show empirically that these early responses have better performance. The streaming rollout maximizes these properties and enables a fully parallel execution of the network reducing runtime on massively parallel devices. Finally, we provide an open-source toolbox to design, train, evaluate, and interact with streaming rollouts. In Sec. 4, we show experimental results that emphasize the difference of rollouts for both, networks with recurrent and skip, and only skip connections. To demonstrate the significance of the chosen rollouts w.r.t. the runtime for inference and achieved accuracy, we compare the two extreme rollouts: the most model-parallel, i.e., streaming rollout (R 1, results in red in Fig. 3), and the most sequential rollout2 (R(e) = 0 for maximal number of edges, results in blue in Fig. 3). For all experiments and rollout patterns under consideration, we conduct inference on shallow rollouts (W = 1) and initialize the zero-th frame of the next rollout window with the last (i.e., 1.) frame of the preceding rollout window (see discussion Sec. 5). Datasets: Rollout patterns are evaluated on three datasets: MNIST [50], CIFAR10 [51], and the German traffic sign recognition benchmark (GTSRB) [52]. Results: Rollouts are compared on the basis of their test accuracies over the duration (measured in update steps) needed to achieve these accuracies (Fig. 3a-c, e, and g).
Researcher Affiliation Industry Volker Fischer Bosch Center for Artificial Intelligence Renningen, Germany volker.fischer@de.bosch.com Jan Köhler Bosch Center for Artificial Intelligence Renningen, Germany jan.koehler@de.bosch.com Thomas Pfeil Bosch Center for Artificial Intelligence Renningen, Germany thomas.pfeil@de.bosch.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Finally, we provide an open-source toolbox to design, train, evaluate, and interact with streaming rollouts. We provide an open-source toolbox specifically designed to study streaming rollouts of deep neural networks. Both are available as open-source code3. (Footnote 3: https://github.com/boschresearch/statestream)
Open Datasets Yes Datasets: Rollout patterns are evaluated on three datasets: MNIST [50], CIFAR10 [51], and the German traffic sign recognition benchmark (GTSRB) [52].
Dataset Splits No The paper mentions evaluating on datasets and test accuracies, but it does not specify explicit training/validation/test dataset splits with percentages, sample counts, or citations to predefined splits.
Hardware Specification No The paper discusses hardware generally and mentions potential future hardware like the "True North chip [56, 57]" in the context of massively parallel execution. However, it does not specify the particular hardware (e.g., GPU/CPU models, memory details) used to run the experiments reported in the paper.
Software Dependencies No For the experiments presented here, we use the Keras toolbox to compare different rollout patterns. Additionally, we implemented an experimental toolbox (Tensorflow and Theano backends) to study (define, train, evaluate, and visualize) networks using the streaming rollout pattern (see Sec. A3). While software names are mentioned, specific version numbers for Keras, TensorFlow, or Theano are not provided.
Experiment Setup Yes Details about data, preprocessing, network architectures, and the training process are given in Sec. A2. All experiments were implemented in Keras [61] with either TensorFlow [60] or Theano [59] backend. Networks were trained with RMSprop [58] with a learning rate of 10−4. To train the SR and S networks (Fig. A2), we used a batch size of 128 and trained for 100 epochs, with 10 epochs of early stopping on the validation set accuracy. For the DSR networks (Fig. A3), we used a batch size of 256 and 200 epochs, with 20 epochs of early stopping.