Cell2Sentence: Teaching Large Language Models the Language of Biology
Authors: Daniel Levine, Syed A Rizvi, Sacha Lévy, Nazreen Pallikkavaliyaveetil, David Zhang, Xingyu Chen, Sina Ghadermarzi, Ruiming Wu, Zihe Zheng, Ivan Vrkic, Anna Zhong, Daphne Raskin, Insu Han, Antonio Henrique De Oliveira Fonseca, Josue Ortega Caro, Amin Karbasi, Rahul Madhav Dhodapkar, David Van Dijk
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments reveal that GPT-2, when fine-tuned with C2S, can generate biologically valid cells based on cell type inputs, and accurately predict cell types from cell sentences. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Yale University, New Haven, CT, USA ... 6Google ... 9Roski Eye Institute, University of Southern California, Los Angeles, CA, USA |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | We plan to open-source our software and cell sentence datasets. |
| Open Datasets | Yes | We focus our experiments on three datasets with extensive natural language metadata and labels, allowing to leverage the capabilities of base models. Immune tissue (Dom ınguez Conde et al., 2022) ... Cytokine stimulation (Dong et al., 2023) ... Multi-tissue (Megill et al., 2021) ... L1000 (Subramanian et al., 2017) and GTEx (Consortium, 2020) |
| Dataset Splits | Yes | We hold out 20% of cell sentences for validation (10%) and testing (10%). |
| Hardware Specification | Yes | Even on a p4d.24xlarge AWS instance with 8 A100 40GB GPUs, half-precision, and flash attention 2, we found it difficult to fit longer sequences without memory issues. |
| Software Dependencies | No | The paper mentions software tools and libraries like Hugging Face, Scanpy, Pythia-160m, AdamW optimizer, and Flash Attention, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We employ a learning rate of 6 10 4 with a cosine scheduler and 1% warmup ratio. For the GPT-2 medium model, we accumulate gradients over 16 steps. The effective batch sizes for the small and medium models are of 10 and 48 examples. |