Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SensorLM: Learning the Language of Wearable Sensors

Authors: Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A. Ali Heydari, Girish Narayanswamy, Max Xu, Ahmed Metwally, Jinhua Xu, Jake Garrison, Xuhai "Orson" Xu, Tim Althoff, Yun Liu, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Cecilia Mascolo, Xin Liu, Daniel McDuff, Yuzhe Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on real-world tasks in human activity analysis and healthcare verify the superior performance of Sensor LM over state-of-the-art in zero-shot recognition, few-shot learning, and cross-modal retrieval. Sensor LM also demonstrates intriguing capabilities including scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to unseen tasks.
Researcher Affiliation	Collaboration	1Google Research 2Google Deep Mind 3University of Cambridge 4University of California, Los Angeles
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor structured steps formatted like code or an algorithm.
Open Source Code	Yes	Code is available at https://github.com/Google-Health/consumer-health-research/tree/main/sensorlm.
Open Datasets	No	Under our IRB-approved protocol, we have obtained informed consent to release a de-identified version of the downstream dataset, including hypertension, anxiety, age, and BMI labels. We are unable to release the pretraining data due to privacy concerns, but the main paper and Appendix provide sufficient details for reproducibility. Access will be limited to researchers with verified academic affiliations.
Dataset Splits	Yes	Activity dataset. The Activity dataset comprises 22,289 person-days from 10,013 individuals. We randomly sampled 1,000 test examples for each activity for zero-shot activity recognition (AR) and few-shot adaptation. Metabolic dataset. The data comprises 241,532 examples (151,346 training and 90,186 testing) from 1,979 individuals. For linear probing, we use the full training set (statistics provided in Table 7). For few-shot learning, we randomly sample 5 50 training examples per class and repeat each experiment five times with different random seeds. Table 18 summarizes the class distribution for the Anxiety and Hypertension tasks.
Hardware Specification	Yes	All models are trained using the Sensor LM pretraining objective (LSensor LM) on Google v6 TPUs for 50k steps.
Software Dependencies	No	The paper mentions 'Adam optimizer' and implicitly uses libraries like 'scikit-learn' in footnotes, but does not provide specific version numbers for any software components.
Experiment Setup	Yes	All models are trained using the Sensor LM pretraining objective (LSensor LM) on Google v6 TPUs for 50k steps. We use a batch size of 1024 sensor-text pairs, with λcon = λcap = 1 for main experiments. The temperature τ for the contrastive loss is set to 0.01. Optimization is performed using Adam optimizer with β1 = 0.9 and β2 = 0.95. A cosine warm-up schedule is applied for the first 10% of training steps, followed by linear decay of the learning rate to zero.