Cell2Sentence: Teaching Large Language Models the Language of Biology

Authors: Daniel Levine, Syed A Rizvi, Sacha Lévy, Nazreen Pallikkavaliyaveetil, David Zhang, Xingyu Chen, Sina Ghadermarzi, Ruiming Wu, Zihe Zheng, Ivan Vrkic, Anna Zhong, Daphne Raskin, Insu Han, Antonio Henrique De Oliveira Fonseca, Josue Ortega Caro, Amin Karbasi, Rahul Madhav Dhodapkar, David Van Dijk

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments reveal that GPT-2, when fine-tuned with C2S, can generate biologically valid cells based on cell type inputs, and accurately predict cell types from cell sentences.
Researcher Affiliation Collaboration 1Department of Computer Science, Yale University, New Haven, CT, USA ... 6Google ... 9Roski Eye Institute, University of Southern California, Los Angeles, CA, USA
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code No We plan to open-source our software and cell sentence datasets.
Open Datasets Yes We focus our experiments on three datasets with extensive natural language metadata and labels, allowing to leverage the capabilities of base models. Immune tissue (Dom ınguez Conde et al., 2022) ... Cytokine stimulation (Dong et al., 2023) ... Multi-tissue (Megill et al., 2021) ... L1000 (Subramanian et al., 2017) and GTEx (Consortium, 2020)
Dataset Splits Yes We hold out 20% of cell sentences for validation (10%) and testing (10%).
Hardware Specification Yes Even on a p4d.24xlarge AWS instance with 8 A100 40GB GPUs, half-precision, and flash attention 2, we found it difficult to fit longer sequences without memory issues.
Software Dependencies No The paper mentions software tools and libraries like Hugging Face, Scanpy, Pythia-160m, AdamW optimizer, and Flash Attention, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We employ a learning rate of 6 10 4 with a cosine scheduler and 1% warmup ratio. For the GPT-2 medium model, we accumulate gradients over 16 steps. The effective batch sizes for the small and medium models are of 10 and 48 examples.