Deep Tensor Convolution on Multicores
Authors: David Budden, Alexander Matveev, Shibani Santurkar, Shraman Ray Chaudhuri, Nir Shavit
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark 2D Conv Net performance against two popular frameworks: Tensor Flow, using the newer Eigen 3.3 library (with AVX support); and Caffe, compiled to use Intel s optimized MKL library. We consider the propagation time of a 224 224 Image Net image through three convolution layers to capture any necessary inter-layer reshuffling. |
| Researcher Affiliation | Academia | 1Massachusetts Institute of Technology. Correspondence to: David Budden <budden@csail.mit.edu>. |
| Pseudocode | Yes | Algorithm 1 Fast Vector Convolution |
| Open Source Code | No | The paper does not state that its own source code for the methodology described is available. It cites third-party open-source projects like "Wincnn. https://github.com/ andravin/wincnn, 2016." and "Nnpack. https://github.com/ Maratyszcza/NNPACK, 2016.", but not its own. |
| Open Datasets | Yes | We benchmark 2D Conv Net performance against two popular frameworks: Tensor Flow, using the newer Eigen 3.3 library (with AVX support); and Caffe, compiled to use Intel s optimized MKL library. We consider the propagation time of a 224 224 Image Net image through three convolution layers to capture any necessary inter-layer reshuffling. |
| Dataset Splits | No | The paper describes using a 224x224 ImageNet image for benchmarking convolution throughput, but it does not specify any training, validation, or test dataset splits for model training or evaluation in the traditional sense, as its experiments focus on operation performance rather than model accuracy. |
| Hardware Specification | Yes | We benchmarked the performance of our fast convolution algorithm on a 1.44 TFLOP/s Xeon E7-8890 CPU and observe that it executes at 70% maximum utilization. This includes all steps from input to output, including all necessary data reshuffling. As a point of comparison, Intel s own MKL convolutional primitive runs at just 20% (excluding reshuffling) on the same processor. |
| Software Dependencies | Yes | Tensor Flow, using the newer Eigen 3.3 library (with AVX support); and Caffe, compiled to use Intel s optimized MKL library. We adopt the Cilk Plus work-stealing scheduler supported by GCC 4.8 (Blumofe et al., 1996; Robison, 2013) |
| Experiment Setup | Yes | We consider the propagation time of a 224 224 Image Net image through three convolution layers to capture any necessary inter-layer reshuffling. We choose this simple architecture over a named network because we are not interested in comparing execution times of pooling, fully-connected or other layers. We also select an obscure kernel size (4 4) for which there have been no Winograd-style fast algorithms published, in order to demonstrate the generality of our implementation to arbitrary kernels. Each layer contains a modest 32 channels and 32 kernels for spreading the cost associated with applying transform matrices. Results presented are the fastest across batch sizes of 1, 20 and 200. |