Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Scaling and Distilling Transformer Models for sEMG
Authors: Nick Mehlman, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Kelvin Niu, Alexander H Miller, Shagun Sodhani
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we demonstrate that vanilla transformer models can be effectively scaled up on s EMG data and yield improved cross-user performance up to 110M parameters, surpassing the model size regime investigated in other s EMG research (usually <10M parameters). We show that >100M-parameter models can be effectively distilled into models 50x smaller with minimal loss of performance (< 1.5% absolute). This results in efficient and expressive models suitable for complex real-time s EMG tasks in real-world environments. Section 3 Experiments: We primarily focus on zero-shot performance on held-out users, and additionally report personalization performance... Table 2: Cross-user performance of transformer models trained on the emg2qwerty dataset... Figure 3: Scaling curve of transformers on the emg2qwerty dataset... |
| Researcher Affiliation | Collaboration | Nicholas Mehlman EMAIL Viterbi School of Engineering, University of Southern California Jean-Christophe Gagnon-Audet EMAIL Meta FAIR Michael Shvartsman EMAIL Meta FAIR Kelvin Niu EMAIL Meta FAIR Alexander H. Miller EMAIL Meta FAIR Shagun Sodhani EMAIL Meta FAIR |
| Pseudocode | No | The paper describes methods and experiments in narrative text. No explicit pseudocode blocks, algorithm figures, or sections labeled 'Pseudocode' or 'Algorithm' were found. |
| Open Source Code | Yes | The code used for training and distilling the models is available at https://github.com/facebookresearch/fairemg. We hope that it will make it easier for the scientific community to reproduce our results and extend this work. |
| Open Datasets | Yes | We use the emg2qwerty dataset (Sivakumar et al., 2024) in our experiments. The dataset consists of two-handed s EMG recordings from users typing on a computer keyboard. The data is labeled with the corresponding keystrokes, and the task is to map from s EMG sequences to character sequences. Figure 1 shows a representative example. In total, the dataset contains 346 hours of s EMG recordings across 108 unique users. Figure 1: The emg2qwerty task: participants type on a keyboard while s EMG activity is recorded from both hands. The goal is to map from sequences of s EMG signals to sequences of characters. Figure cropped from https://github.com/facebookresearch/emg2qwerty, licensed CC BY-NC-SA. |
| Dataset Splits | Yes | The dataset is split into 100 users for training and validation and 8 held-out users for testing. For each user, we hold out 2 validation sessions and 2 testing sessions, then use the rest for training. In the generic setting, we train on the 100 user training set, validate on the 100 user validation set and evaluate on the 8 user testing set. In the personalization setting, for each of the 8 users, we train on their individual training set, then validate and test on their respective validation and testing set. Sessions are windowed to form 4 second samples, padded with an additional 900 ms of past context and 100 ms of future context. |
| Hardware Specification | Yes | We train the transformer models on a single HPC node containing 8 32GB V100 GPUs, in a Distributed Data Parallel (DDP) training scheme. |
| Software Dependencies | Yes | Software Torch version 2.3.1+cu121 Transformers version 4.36.2 |
| Experiment Setup | Yes | We have trained 20 different architectures, generated by permuting [2, 4, 6, 8, 10] layers and inner dimension of [128, 256, 512, 1024]. The ratio of the transformer inner dimension and the transformer feed-forward dimension is fixed at four. While we report on performance of all models in the supplement, for ease of exposition in the main text, we are concerned with three reference architectures: the Tiny architecture consisting of 10 layers of inner dimension 128 (about 2.2M parameters); the Small architecture consisting of 6 layers of inner dimension 256 (about 5.4M parameters, close to the 5.3M TDS-Conv Net baseline); and the Large architecture consisting of 8 layers of inner dimension 1024 and about 109M parameters. We use the AdamW optimizer (Loshchilov, 2017) for training all models. In the figures we additionally include other models along the size-performance Pareto frontier (i.e. ones which perform better than any model of the same or lesser parameter count). For all the experiments, we report standard deviation across multiple seeds (6 seeds for supervised training of transformer models, 3 seeds for personalization experiments, and 3 seeds for distillation experiments). Following Sivakumar et al. (2024), we use the connectionist temporal classification (CTC) loss (Graves et al., 2006) to train the transformer models on the emg2qwerty task. We use cosine learning-rate scheduling (Loshchilov & Hutter, 2017) with linear warmup for first 5% of updates. We document all the hyperparameters in the Appendix in Section B. We set β = 1.0 and experimented with α values in [0.1, 2] and found the best values to be in the [0.3, 0.5] range, we use α = 0.5 throughout our distillation experiments. The rest of the distillation hyperparameters used in our experimentation are documented in Appendix B.3. |