Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CellPLM: Pre-training of Cell Language Model Beyond Single Cells
Authors: Hongzhi Wen, Wenzhuo Tang, Xinnan Dai, Jiayuan Ding, Wei Jin, Yuying Xie, Jiliang Tang
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | It is evident from our experiments that Cell PLM consistently outperforms both pre-trained and non-pre-trained methods across five distinct downstream tasks, with 100 times higher inference speed on generating cell embeddings compared to existing pre-trained models. |
| Researcher Affiliation | Academia | 1Michigan State University 2Emory University |
| Pseudocode | No | The paper describes methods in prose and with diagrams (e.g., Figure 2) but does not contain a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | The checkpoint of our pre-trained is released on our Github1 repository, as well as the source codes for fine-tuning and zero-shot experiments. 1Github link of Cell PLM: https://github.com/Omics ML/Cell PLM |
| Open Datasets | Yes | All the data we used in this study are publicly available data. The data sources are specified in the appendix. 10x genomics datasets. https://support.10xgenomics.com/ single-cellgene-expression/datasets, a. |
| Dataset Splits | Yes | Additionally, for methods require model selection on validation set, we performed another 10% simulation dropout and treat masked entries as validation set. |
| Hardware Specification | Yes | the pre-training was finished in less than 24 hours on a GPU server with 8 Nvidia Tesla v100 16GB cards. Table 1: Inference time(s) for querying 48, 082 cells on an A100 40GB GPU. |
| Software Dependencies | No | We used inner join by default of Anndata package. We implemented Deep Impute with default settings in DANCE Ding et al. (2022) package. We utilized R package SAVER to illustrate the performance of it. The paper mentions software packages but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | The hyperparameters, datasets, and reproducibility information for pre-trained models are detailed in Appendix E. Table 5: Hyperparameters for pretraining Cell PLM model. |