LangCell: Language-Cell Pre-training for Cell Identity Understanding
Authors: Suyuan Zhao, Jiahuan Zhang, Yushuai Wu, Yizhen Luo, Zaiqing Nie
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results from experiments conducted on different benchmarks show that Lang Cell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios, and also significantly outperforms existing models in few-shot and fine-tuning cell identity understanding scenarios. |
| Researcher Affiliation | Collaboration | 1Institute for AI Industry Research (AIR), Tsinghua University 2Department of Computer Science and Tecnology, Tsinghua University 3Phar Molix Inc. |
| Pseudocode | No | The paper describes its methods but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/Phar Molix/Lang Cell. |
| Open Datasets | Yes | We established sc Library, a comprehensive dataset comprising roughly 27.5 million pairs of sc RNA-seq data and associated textual descriptions. This dataset was sourced from the CELLx GENE (Biology et al., 2023) database, where we acquired sc RNA-seq data in raw count matrix format and the corresponding meta data. |
| Dataset Splits | No | For fine-tuning tasks, all models are trained for the same number of epochs. Cell type annotation uses a training:test split of 2:1, while pathway identification uses a training:test split of 3:7. Users of Lang Cell can use a small validation set to select the optimal α for a specific task. |
| Hardware Specification | Yes | The pre-training is conducted on four NVIDIA Tesla A100 GPUs and takes approximately 50 days to complete. |
| Software Dependencies | No | The training process was conducted using the Pytorch framework and the Hugging Face transformers library. |
| Experiment Setup | Yes | The training process was conducted using the Pytorch framework and the Hugging Face transformers library. We employed the Adam W optimizer, with the learning rate warmed up to 1e-5 over 1000 steps, followed by a linear decay strategy. Weight decay was set to 0.001. More detailed parameter settings can be found in the Appendix C. Table C.0.1: Experiment Configurations: Vocab size 25427, Hidden size 512, Number of hidden layers 12, Max sequence length 2048, Number of attention heads 8, Dropout 0.02, Hidden act Re LU, Layer Norm eps 1e-12, Max learning rate 1e-5, Warm up steps 1000, Weight decay 1e-3, Batch size 3, Gradient accumulation 32. |