On-Device Training Under 256KB Memory
Authors: Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, Song Han
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. ... Our framework is the first practical solution for on-device transfer learning of visual recognition on tiny IoT devices (e.g., a microcontroller with only 256KB SRAM), using less than 1/1000 of the memory of PyTorch and TensorFlow while matching the accuracy. ... 3 Experiments. We used three popular tiny ML models in our experiments: MobileNet V2... Proxyless NAS... MCUNet... We perform the training and memory/latency measurement on a microcontroller STM32F746 (320KB SRAM, 1MB Flash) using a single batch size. |
| Researcher Affiliation | Collaboration | Ji Lin1 Ligeng Zhu1 Wei-Ming Chen1 Wei-Chen Wang1 Chuang Gan2 Song Han1 1MIT 2MIT-IBM Watson AI Lab |
| Pseudocode | No | The paper describes its algorithms and system design with figures and textual explanations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | A video demo can be found here. (referring to https://tinyml.mit.edu/on-device-training). There is no explicit statement or link confirming the release of the source code for the methodology described in the paper. |
| Open Datasets | Yes | We measure the transfer learning accuracy on multiple downstream datasets and report the average accuracy [37]. We follow [12] to use a set of vision datasets including Cars [39], CIFAR10 [40], CIFAR-100 [40], CUB [67], Flowers [54], Food [9], and Pets [55]. We also include VWW dataset [20], a widely used benchmark for tiny ML applications. |
| Dataset Splits | Yes | We fine-tuned the models on all these datasets for 50 epochs following [12]. We train on VWW for 10 epochs following [47]. ... Figure 8. Training and validation loss curves w/ and w/o QAS. |
| Hardware Specification | Yes | We perform the training and memory/latency measurement on a microcontroller STM32F746 (320KB SRAM, 1MB Flash) using a single batch size. ... We deployed our training system to a Cortex M7 microcontroller STM32F746 to demonstrate the feasibility |
| Software Dependencies | No | The paper mentions software like PyTorch [56] and TensorFlow [4] for comparison, and TF-Lite Micro kernels, but does not provide specific version numbers for any software dependencies used in its own experimental setup. |
| Experiment Setup | Yes | We used three popular tiny ML models in our experiments: MobileNet V2 [60] (width multiplier 0.35...), Proxyless NAS [13] (width multiplier 0.3...), MCUNet [47] (...). We pre-trained the models on ImageNet [22] and perform post-training quantization [34]. The quantized models are fine-tuned on downstream datasets... We perform the training and memory/latency measurement... using a single batch size. ... We fine-tuned the models on all these datasets for 50 epochs following [12]. We train on VWW for 10 epochs following [47]. We used resolution 128 for all datasets and models for a fair comparison. Please refer to the the appendix (Section C) for detailed training hyper-parameters. |