High Performance Zero-Memory Overhead Direct Convolutions

Authors: Jiyuan Zhang, Franz Franchetti, Tze Meng Low

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we demonstrate that direct convolution, when implemented correctly, eliminates all memory overhead, and yields performance that is between 10% to 400% times better than existing high performance implementations of convolution layers on conventional and embedded CPU architectures. We also show that a high performance direct convolution exhibits better scaling performance, i.e. suffers less performance drop, when increasing the number of threads. Section 5.1 Experimental Setup. Platform We run our experiments on Intel Core i7-4770K, AMD FX(tm)-8350, ARM Cortex-A57 architectures. The architecture details of those platforms are shown in Table .
Researcher Affiliation Academia Jiyuan Zhang 1 Franz Franchetti 1 Tze Meng Low 1 1Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, USA.
Pseudocode Yes Algorithm 1 Naive Convolution Algorithm and Algorithm 2 Reorder Convolution Algorithm and Algorithm 3 Parallelized Direct Convolution Algorithm
Open Source Code No The paper does not state that the code for their direct convolution implementation is publicly available or provide a link to it.
Open Datasets Yes All implementations were ran against all convolution layers found in Alex Net (Krizhevsky et al., 2012), Goog Le Net (Szegedy et al., 2015) and VGG (Simonyan & Zisserman, 2014).
Dataset Splits No The paper does not provide specific details on training, validation, or test dataset splits (e.g., percentages or sample counts) for the convolution layers benchmarked from AlexNet, GoogLeNet, and VGG.
Hardware Specification Yes Platform We run our experiments on Intel Core i7-4770K, AMD FX(tm)-8350, ARM Cortex-A57 architectures. The architecture details of those platforms are shown in Table 1. Table 1. Details of specific architectures used Intel AMD ARM i7-4770K FX(tm)-8350 Cortex-A57 Architecture Haswell Piledriver ARMv8 Frequency 3.5GHz 4GHz 1.1GHz Cores 4 4 2 Nvec 8 8 4
Software Dependencies No The paper mentions software like Intel Math Kernel Library (MKL), Open BLAS, and NNPACK but does not provide specific version numbers for these components, which is required for reproducibility.
Experiment Setup No The paper details algorithmic and architectural mapping strategies for direct convolution but does not provide typical experimental setup details such as hyperparameters (e.g., learning rate, batch size, number of epochs) for training a deep neural network.