MEC: Memory-efficient Convolution for Deep Neural Network

Authors: Minsik Cho, Daniel Brand

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that MEC reduces memory consumption significantly with good speedup on both mobile and server platforms, compared with other indirect convolution algorithms.
Researcher Affiliation Industry IBM T. J. Watson Research Center, NY, USA.
Pseudocode Yes Algorithm 1 O = Vanilla MEC(I, K, s) and Algorithm 2 O = MEC(I, K, s) are provided.
Open Source Code No The paper mentions implementing its algorithm with existing libraries and using other open-source convolutions for comparison, but it does not state that the code for MEC itself is open-source or provide a link to it.
Open Datasets Yes For thorough comparison, we built a comprehensive benchmark set consisting of 12 unique convolution layers, cv1-cv12 from various public DNNs (He et al., 2015; Krizhevsky et al., 2012; Sermanet et al., 2013; Simonyan & Zisserman, 2014; Szegedy et al., 2014) as in Table 2.
Dataset Splits No The paper mentions 'mini-batch size' and refers to training, but it does not provide explicit details about train/validation/test dataset splits, percentages, or methods for partitioning the data.
Hardware Specification Yes Mobile Android phone with ARM7 (MSM8960) for userside inference and training (mini-bath size=1). Server Linux server with Intel CPU (E5-2680) and Nvidia GPU (P100) for inference and training (mini-bath size=32).
Software Dependencies No We implemented MEC for CPU/GPU in C++ with multithreaded Open BLAS, Open MP, and cu BLAS (cu BLAS) using single 32-bit precision. We also implemented a fully parallelized im2col-based convolution on CPU/GPU (Jia, 2014) with the same libraries. We downloaded an open-source FFT-based convolution (cu FFT; Theano-FFT) for GPU. We took an open-source Winograd-based convolution (Falcon, 2016) and optimized it to reduce memory-overhead for CPU, and further modified/optimized it for GPU following (Lavin, 2015; Park et al., 2016a). (Note: Specific version numbers for these libraries/tools are not provided).
Experiment Setup Yes The runtime in our experiments is measured as a wall-clock time by a standard C++ library, running each algorithm 10 times and reporting the average. Mobile Android phone with ARM7 (MSM8960) for userside inference and training (mini-bath size=1). Server Linux server with Intel CPU (E5-2680) and Nvidia GPU (P100) for inference and training (mini-bath size=32). T is a platform-dependent parameter (e.g., on CPU vs. GPU, or on GPU-compute capability), and we found T around 100 to be a good threshold for latest GPUs.