reproducibilityindex.ai

LangCell: Language-Cell Pre-training for Cell Identity Understanding

Authors: Suyuan Zhao, Jiahuan Zhang, Yushuai Wu, Yizhen Luo, Zaiqing Nie

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results from experiments conducted on different benchmarks show that Lang Cell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios, and also significantly outperforms existing models in few-shot and fine-tuning cell identity understanding scenarios.
Researcher Affiliation	Collaboration	1Institute for AI Industry Research (AIR), Tsinghua University 2Department of Computer Science and Tecnology, Tsinghua University 3Phar Molix Inc.
Pseudocode	No	The paper describes its methods but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https://github.com/Phar Molix/Lang Cell.
Open Datasets	Yes	We established sc Library, a comprehensive dataset comprising roughly 27.5 million pairs of sc RNA-seq data and associated textual descriptions. This dataset was sourced from the CELLx GENE (Biology et al., 2023) database, where we acquired sc RNA-seq data in raw count matrix format and the corresponding meta data.
Dataset Splits	No	For fine-tuning tasks, all models are trained for the same number of epochs. Cell type annotation uses a training:test split of 2:1, while pathway identification uses a training:test split of 3:7. Users of Lang Cell can use a small validation set to select the optimal α for a specific task.
Hardware Specification	Yes	The pre-training is conducted on four NVIDIA Tesla A100 GPUs and takes approximately 50 days to complete.
Software Dependencies	No	The training process was conducted using the Pytorch framework and the Hugging Face transformers library.
Experiment Setup	Yes	The training process was conducted using the Pytorch framework and the Hugging Face transformers library. We employed the Adam W optimizer, with the learning rate warmed up to 1e-5 over 1000 steps, followed by a linear decay strategy. Weight decay was set to 0.001. More detailed parameter settings can be found in the Appendix C. Table C.0.1: Experiment Configurations: Vocab size 25427, Hidden size 512, Number of hidden layers 12, Max sequence length 2048, Number of attention heads 8, Dropout 0.02, Hidden act Re LU, Layer Norm eps 1e-12, Max learning rate 1e-5, Warm up steps 1000, Weight decay 1e-3, Batch size 3, Gradient accumulation 32.