CellPLM: Pre-training of Cell Language Model Beyond Single Cells

Authors: Hongzhi Wen, Wenzhuo Tang, Xinnan Dai, Jiayuan Ding, Wei Jin, Yuying Xie, Jiliang Tang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental It is evident from our experiments that Cell PLM consistently outperforms both pre-trained and non-pre-trained methods across five distinct downstream tasks, with 100 times higher inference speed on generating cell embeddings compared to existing pre-trained models.
Researcher Affiliation Academia 1Michigan State University 2Emory University
Pseudocode No The paper describes methods in prose and with diagrams (e.g., Figure 2) but does not contain a dedicated pseudocode or algorithm block.
Open Source Code Yes The checkpoint of our pre-trained is released on our Github1 repository, as well as the source codes for fine-tuning and zero-shot experiments. 1Github link of Cell PLM: https://github.com/Omics ML/Cell PLM
Open Datasets Yes All the data we used in this study are publicly available data. The data sources are specified in the appendix. 10x genomics datasets. https://support.10xgenomics.com/ single-cellgene-expression/datasets, a.
Dataset Splits Yes Additionally, for methods require model selection on validation set, we performed another 10% simulation dropout and treat masked entries as validation set.
Hardware Specification Yes the pre-training was finished in less than 24 hours on a GPU server with 8 Nvidia Tesla v100 16GB cards. Table 1: Inference time(s) for querying 48, 082 cells on an A100 40GB GPU.
Software Dependencies No We used inner join by default of Anndata package. We implemented Deep Impute with default settings in DANCE Ding et al. (2022) package. We utilized R package SAVER to illustrate the performance of it. The paper mentions software packages but does not provide specific version numbers for any of them.
Experiment Setup Yes The hyperparameters, datasets, and reproducibility information for pre-trained models are detailed in Appendix E. Table 5: Hyperparameters for pretraining Cell PLM model.