Cell ontology guided transcriptome foundation model

Authors: XINYU YUAN, Zhihao Zhan, Zuobai Zhang, Manqi Zhou, Jianan Zhao, Boyu Han, Yue Li, Jian Tang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the generalizability and transferability of sc Cello on 22 million cells from Cellx Gene. For model generalization, we observe that sc Cello excels on cell type identification across all datasets in both zero-shot setting (i.e., directly using the pre-trained model) (Sec. 4.2.1) and fine-tuning setting (Sec. 4.2.2). In particular, sc Cello accurately classifies novel cell types by leveraging the ontology graph structure (Sec. 4.3). For transferability, sc Cello demonstrates competitive performances in predicting cell-type-specific marker genes (Sec. 4.4) and cancer drug responses (Sec. 4.5). Additionally, sc Cello is robust against batch effects (Sec. 4.6). Finally, we validate our contribution via ablation study (Sec. 4.7).
Researcher Affiliation Academia 1Mila Québec AI Institute, 2University of Montréal 3Mc Gill University, 4Cornell University, 5HEC Montréal, 6CIFAR AI Chair
Pseudocode No The paper describes methods and processes in narrative text and mathematical formulas but does not include explicit pseudocode blocks or algorithm listings.
Open Source Code Yes Source code and model weights are available at https://github.com/Deep Graph Learning/sc Cello.
Open Datasets Yes We pre-trained sc Cello on 22 million cells from Cellx Gene database leveraging their cell-type labels mapped to the cell ontology graph from Open Biological and Biomedical Ontology Foundry. Our TFM demonstrates competitive generalization and transferability performance over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses. Source code and model weights are available at https://github.com/Deep Graph Learning/sc Cello. The sc RNA-seq data were downloaded from Cellx Gene.
Dataset Splits Yes We fine-tuned TFMs on a subset of our curated pre-training data, randomly selecting 90% for training and using the remaining 10% for validation.
Hardware Specification Yes An Adam optimizer [38] (learning rate: 0.001, weight decay: 0.001, warm-up steps: 3, 333) was used to train the sc Cello for 40, 000 steps on 4 NVIDIA A100 GPUs on Compute Canada.
Software Dependencies No The paper mentions software like Adam optimizer, Scanpy, Louvain algorithm, and RAPIDS, but does not provide specific version numbers for these software components.
Experiment Setup Yes An Adam optimizer [38] (learning rate: 0.001, weight decay: 0.001, warm-up steps: 3, 333) was used to train the sc Cello for 40, 000 steps on 4 NVIDIA A100 GPUs on Compute Canada. We used 192 for batch size. More details are introduced in App. D.