Deep Tensor Convolution on Multicores

Authors: David Budden, Alexander Matveev, Shibani Santurkar, Shraman Ray Chaudhuri, Nir Shavit

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark 2D Conv Net performance against two popular frameworks: Tensor Flow, using the newer Eigen 3.3 library (with AVX support); and Caffe, compiled to use Intel s optimized MKL library. We consider the propagation time of a 224 224 Image Net image through three convolution layers to capture any necessary inter-layer reshuffling.
Researcher Affiliation Academia 1Massachusetts Institute of Technology. Correspondence to: David Budden <budden@csail.mit.edu>.
Pseudocode Yes Algorithm 1 Fast Vector Convolution
Open Source Code No The paper does not state that its own source code for the methodology described is available. It cites third-party open-source projects like "Wincnn. https://github.com/ andravin/wincnn, 2016." and "Nnpack. https://github.com/ Maratyszcza/NNPACK, 2016.", but not its own.
Open Datasets Yes We benchmark 2D Conv Net performance against two popular frameworks: Tensor Flow, using the newer Eigen 3.3 library (with AVX support); and Caffe, compiled to use Intel s optimized MKL library. We consider the propagation time of a 224 224 Image Net image through three convolution layers to capture any necessary inter-layer reshuffling.
Dataset Splits No The paper describes using a 224x224 ImageNet image for benchmarking convolution throughput, but it does not specify any training, validation, or test dataset splits for model training or evaluation in the traditional sense, as its experiments focus on operation performance rather than model accuracy.
Hardware Specification Yes We benchmarked the performance of our fast convolution algorithm on a 1.44 TFLOP/s Xeon E7-8890 CPU and observe that it executes at 70% maximum utilization. This includes all steps from input to output, including all necessary data reshuffling. As a point of comparison, Intel s own MKL convolutional primitive runs at just 20% (excluding reshuffling) on the same processor.
Software Dependencies Yes Tensor Flow, using the newer Eigen 3.3 library (with AVX support); and Caffe, compiled to use Intel s optimized MKL library. We adopt the Cilk Plus work-stealing scheduler supported by GCC 4.8 (Blumofe et al., 1996; Robison, 2013)
Experiment Setup Yes We consider the propagation time of a 224 224 Image Net image through three convolution layers to capture any necessary inter-layer reshuffling. We choose this simple architecture over a named network because we are not interested in comparing execution times of pooling, fully-connected or other layers. We also select an obscure kernel size (4 4) for which there have been no Winograd-style fast algorithms published, in order to demonstrate the generality of our implementation to arbitrary kernels. Each layer contains a modest 32 channels and 32 kernels for spreading the cost associated with applying transform matrices. Results presented are the fastest across batch sizes of 1, 20 and 200.