LangCell: Language-Cell Pre-training for Cell Identity Understanding

Authors: Suyuan Zhao, Jiahuan Zhang, Yushuai Wu, Yizhen Luo, Zaiqing Nie

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results from experiments conducted on different benchmarks show that Lang Cell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios, and also significantly outperforms existing models in few-shot and fine-tuning cell identity understanding scenarios.
Researcher Affiliation Collaboration 1Institute for AI Industry Research (AIR), Tsinghua University 2Department of Computer Science and Tecnology, Tsinghua University 3Phar Molix Inc.
Pseudocode No The paper describes its methods but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https://github.com/Phar Molix/Lang Cell.
Open Datasets Yes We established sc Library, a comprehensive dataset comprising roughly 27.5 million pairs of sc RNA-seq data and associated textual descriptions. This dataset was sourced from the CELLx GENE (Biology et al., 2023) database, where we acquired sc RNA-seq data in raw count matrix format and the corresponding meta data.
Dataset Splits No For fine-tuning tasks, all models are trained for the same number of epochs. Cell type annotation uses a training:test split of 2:1, while pathway identification uses a training:test split of 3:7. Users of Lang Cell can use a small validation set to select the optimal α for a specific task.
Hardware Specification Yes The pre-training is conducted on four NVIDIA Tesla A100 GPUs and takes approximately 50 days to complete.
Software Dependencies No The training process was conducted using the Pytorch framework and the Hugging Face transformers library.
Experiment Setup Yes The training process was conducted using the Pytorch framework and the Hugging Face transformers library. We employed the Adam W optimizer, with the learning rate warmed up to 1e-5 over 1000 steps, followed by a linear decay strategy. Weight decay was set to 0.001. More detailed parameter settings can be found in the Appendix C. Table C.0.1: Experiment Configurations: Vocab size 25427, Hidden size 512, Number of hidden layers 12, Max sequence length 2048, Number of attention heads 8, Dropout 0.02, Hidden act Re LU, Layer Norm eps 1e-12, Max learning rate 1e-5, Warm up steps 1000, Weight decay 1e-3, Batch size 3, Gradient accumulation 32.