Multi-modal Transfer Learning between Biological Foundation Models
Authors: Juan Jose Garau-Luis, Patrick Bordes, Liam Gonzalez, Maša Roller, Bernardo de Almeida, Christopher Blum, Lorenz Hexemer, Stefan Laurent, Maren Lang, Thomas Pierrot, Guillaume Richard
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate its capabilities by applying it to the largely unsolved problem of predicting how multiple RNA transcript isoforms originate from the same gene (i.e. same DNA sequence) and map to different transcription expression levels across various human tissues. We show that our model, dubbed Iso Former, is able to accurately predict differential transcript expression, outperforming existing methods and leveraging the use of multiple modalities. We performed ablation studies to validate our different architectural choices. |
| Researcher Affiliation | Industry | Juan Jose Garau-Luis Insta Deep Patrick Bordes Insta Deep Liam Gonzalez Insta Deep Masa Roller Insta Deep Bernardo P. de Almeida Insta Deep Lorenz Hexemer BioNTech Christopher Blum BioNTech Stefan Laurent BioNTech Jan Grzegorzewski BioNTech Maren Lang BioNTech Thomas Pierrot Insta Deep Guillaume Richard Insta Deep |
| Pseudocode | No | The paper describes the model architecture and procedures in prose and with diagrams (Figure 1, 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We open-source our model, paving the way for new multi-modal gene expression approaches. We make the weights of this Iso Former model available on Hugging Face5. 5https://huggingface.co/Insta Deep AI/isoformer |
| Open Datasets | Yes | We conducted our analysis of Iso Former on RNA transcript expression data obtained from the GTEx3 portal. We based our dataset on the Genotype-Tissue Expression (GTEx) portal. Specifically, we use the 8th release of the Transcript TPMs table6. 3https://www.gtexportal.org/home/downloads/adult-gtex/bulk_tissue_expression |
| Dataset Splits | Yes | Our dataset has a fixed train and test set, divided by genes; all presented results correspond to the performance on the test set. We used the Adam optimizer with a learning rate of 3 10 5 and batch size of 64, and used early stopping on a validation set comprised of 5% of the train set to reduce training time. |
| Hardware Specification | Yes | All experiments were carried out with 5 seeds on 4 A100 GPUs (80GB RAM). |
| Software Dependencies | No | The paper mentions using the Adam optimizer but does not specify version numbers for any software libraries, frameworks, or programming languages used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We used the Adam optimizer with a learning rate of 3 10 5 and batch size of 64, and used early stopping on a validation set comprised of 5% of the train set to reduce training time. We provide model hyper-parameters in Table 6 and the hyper-parameters of the encoders in Table 7. |