On-Device Training Under 256KB Memory

Authors: Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, Song Han

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. ... Our framework is the first practical solution for on-device transfer learning of visual recognition on tiny IoT devices (e.g., a microcontroller with only 256KB SRAM), using less than 1/1000 of the memory of PyTorch and TensorFlow while matching the accuracy. ... 3 Experiments. We used three popular tiny ML models in our experiments: MobileNet V2... Proxyless NAS... MCUNet... We perform the training and memory/latency measurement on a microcontroller STM32F746 (320KB SRAM, 1MB Flash) using a single batch size.
Researcher Affiliation Collaboration Ji Lin1 Ligeng Zhu1 Wei-Ming Chen1 Wei-Chen Wang1 Chuang Gan2 Song Han1 1MIT 2MIT-IBM Watson AI Lab
Pseudocode No The paper describes its algorithms and system design with figures and textual explanations but does not include structured pseudocode or algorithm blocks.
Open Source Code No A video demo can be found here. (referring to https://tinyml.mit.edu/on-device-training). There is no explicit statement or link confirming the release of the source code for the methodology described in the paper.
Open Datasets Yes We measure the transfer learning accuracy on multiple downstream datasets and report the average accuracy [37]. We follow [12] to use a set of vision datasets including Cars [39], CIFAR10 [40], CIFAR-100 [40], CUB [67], Flowers [54], Food [9], and Pets [55]. We also include VWW dataset [20], a widely used benchmark for tiny ML applications.
Dataset Splits Yes We fine-tuned the models on all these datasets for 50 epochs following [12]. We train on VWW for 10 epochs following [47]. ... Figure 8. Training and validation loss curves w/ and w/o QAS.
Hardware Specification Yes We perform the training and memory/latency measurement on a microcontroller STM32F746 (320KB SRAM, 1MB Flash) using a single batch size. ... We deployed our training system to a Cortex M7 microcontroller STM32F746 to demonstrate the feasibility
Software Dependencies No The paper mentions software like PyTorch [56] and TensorFlow [4] for comparison, and TF-Lite Micro kernels, but does not provide specific version numbers for any software dependencies used in its own experimental setup.
Experiment Setup Yes We used three popular tiny ML models in our experiments: MobileNet V2 [60] (width multiplier 0.35...), Proxyless NAS [13] (width multiplier 0.3...), MCUNet [47] (...). We pre-trained the models on ImageNet [22] and perform post-training quantization [34]. The quantized models are fine-tuned on downstream datasets... We perform the training and memory/latency measurement... using a single batch size. ... We fine-tuned the models on all these datasets for 50 epochs following [12]. We train on VWW for 10 epochs following [47]. We used resolution 128 for all datasets and models for a fair comparison. Please refer to the the appendix (Section C) for detailed training hyper-parameters.