MEC: Memory-efficient Convolution for Deep Neural Network
Authors: Minsik Cho, Daniel Brand
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that MEC reduces memory consumption significantly with good speedup on both mobile and server platforms, compared with other indirect convolution algorithms. |
| Researcher Affiliation | Industry | IBM T. J. Watson Research Center, NY, USA. |
| Pseudocode | Yes | Algorithm 1 O = Vanilla MEC(I, K, s) and Algorithm 2 O = MEC(I, K, s) are provided. |
| Open Source Code | No | The paper mentions implementing its algorithm with existing libraries and using other open-source convolutions for comparison, but it does not state that the code for MEC itself is open-source or provide a link to it. |
| Open Datasets | Yes | For thorough comparison, we built a comprehensive benchmark set consisting of 12 unique convolution layers, cv1-cv12 from various public DNNs (He et al., 2015; Krizhevsky et al., 2012; Sermanet et al., 2013; Simonyan & Zisserman, 2014; Szegedy et al., 2014) as in Table 2. |
| Dataset Splits | No | The paper mentions 'mini-batch size' and refers to training, but it does not provide explicit details about train/validation/test dataset splits, percentages, or methods for partitioning the data. |
| Hardware Specification | Yes | Mobile Android phone with ARM7 (MSM8960) for userside inference and training (mini-bath size=1). Server Linux server with Intel CPU (E5-2680) and Nvidia GPU (P100) for inference and training (mini-bath size=32). |
| Software Dependencies | No | We implemented MEC for CPU/GPU in C++ with multithreaded Open BLAS, Open MP, and cu BLAS (cu BLAS) using single 32-bit precision. We also implemented a fully parallelized im2col-based convolution on CPU/GPU (Jia, 2014) with the same libraries. We downloaded an open-source FFT-based convolution (cu FFT; Theano-FFT) for GPU. We took an open-source Winograd-based convolution (Falcon, 2016) and optimized it to reduce memory-overhead for CPU, and further modified/optimized it for GPU following (Lavin, 2015; Park et al., 2016a). (Note: Specific version numbers for these libraries/tools are not provided). |
| Experiment Setup | Yes | The runtime in our experiments is measured as a wall-clock time by a standard C++ library, running each algorithm 10 times and reporting the average. Mobile Android phone with ARM7 (MSM8960) for userside inference and training (mini-bath size=1). Server Linux server with Intel CPU (E5-2680) and Nvidia GPU (P100) for inference and training (mini-bath size=32). T is a platform-dependent parameter (e.g., on CPU vs. GPU, or on GPU-compute capability), and we found T around 100 to be a good threshold for latest GPUs. |