MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory
Authors: Yinan Liang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our MCUFormer achieves 73.62% top-1 accuracy on Image Net for image classification with 320KB memory on STM32F746 microcontroller. Code is available at https://github.com/liangyn22/MCUFormer. |
| Researcher Affiliation | Academia | {1Shenzhen International Graduate School, 2Department of Automation}, Tsinghua University |
| Pseudocode | No | We detail the pseudo algorithm of integer-only square root operation in the supplementary material. |
| Open Source Code | Yes | Code is available at https://github.com/liangyn22/MCUFormer. |
| Open Datasets | Yes | We conduct the experiments on Image Net for image classification, which contains 1.2 million training images and 50k validation images from 1000 classes. All images are scaled and biased into the range [-1,1] for normalization. For the training process, we resize the images with the shorter side as 256 and randomly crop a 240 240 region. |
| Dataset Splits | Yes | We conduct the experiments on Image Net for image classification, which contains 1.2 million training images and 50k validation images from 1000 classes. |
| Hardware Specification | Yes | We deploy the vision transformers with our hardware-algorithm cooptimizations framework in different microcontrollers with various resource constraint including STM32F427 (Cortex-M4/256KB memory/1MB flash), STM32F746 (Cortex-M7/320KB memory/1MB flash) and STM32H743 (Cortex-M7/512KB memory/2MB flash). |
| Software Dependencies | No | The paper mentions software frameworks like Tensor Flow Lite Micro, CMSIS-NN, CMix-NN, Micro TVM, and Tiny Engine, but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | For the network architecture search of vision transformers, our choices for the search space consisting of low-rank decomposition ratio r and the token size c can be selected from r [0.4 : 0.05 : 0.95] and c {16, 20, 24, 28, 32}. ... For operator library construction, we utilize int8 quantization for all tensors in the vision transformer during inference. The filter size of the decomposed patch embedding layer is set to 4 4 with multiple forward passes to reduce the peak memory, and we iterate the surrogate assignment from the fixed-point iterative methods for 4 times to calculate the square root in the layer normalization operators. |