How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval

Authors: Philip Fradkin, Puria Azadi Moghadam, Karush Suri, Frederik Wenkel, Ali Bashashati, Maciej Sypetkowski, Dominique Beaini

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate improved multi-modal learner retrieval through (1) a uni-modal pre-trained phenomics model, (2) a novel inter sample similarity aware loss, and (3) models conditioned on a representation of molecular concentration. Following this recipe, we propose Mol Phenix, a molecular phenomics model. Mol Phenix leverages a pre-trained phenomics model to demonstrate significant performance gains across perturbation concentrations, molecular scaffolds, and activity thresholds. In particular, we demonstrate an 8.1 improvement in zero shot molecular retrieval of active molecules over the previous state-of-the-art, reaching 77.33% in top-1% accuracy.
Researcher Affiliation Collaboration Philip Fradkin1,2, , Puria Azadi1,3, , Karush Suri1, Frederik Wenkel1, Ali Bashashati3, Maciej Sypetkowski1 , Dominique Beaini1,4, 1 Valence Labs, 2 University of Toronto, Vector Institute, 3 University of British Columbia, 4 Université de Montréal, Mila Quebec AI Institute
Pseudocode Yes Algorithm 1 S2L loss pseudo-implementation. 1: # mol_emb : molecule model embedding [n, dim] 2: # phn_emb : phenomics model embedding [n, dim] 3: # t_prime, b : learnable temperature and bias 4: # n : mini-batch size 5: # : custom similarity function 6: # γ, ζ : similarity dampening parameters 7: 8: t = exp(t_prime) 9: zmol = l2_normalize(mol_emb) 10: zphn = l2_normalize(phn_emb) 11: logits = dot(zmol, zphn.T) * t + b 12: sim_matrix = zphn, zphn.T # [n, n] sample similarity matrix 13: pos = log_sigmoid(logits) 14: neg = log_sigmoid(logits) 15: l = sim_matrix * pos + (γ ζ sim_matrix) * neg 16: l = sum(l) / n
Open Source Code No As part of the submission, we are unable to provide code to reproduce model training due to use of its proprietary nature. The training dataset is also an asset of a private institution, meaning that we are unable to be made publicly accessible.
Open Datasets Yes Our evaluation is performed on a publicly accessible dataset RXRX3, allowing for benchmarking of other methods. [...] Evaluation set 3 Unseen Dataset: Finally, we utilize the RXRX3 dataset [16], an open-source out of distribution (OOD) dataset consisting of 6,549 novel molecule and concentration pairs associated with phenomic experiments.
Dataset Splits Yes Our training dataset comprises 1,316,283 pairs of molecules and concentration concentration combinations, complemented by fluorescent microscopy images generated through over 2,150,000 phenomic experiments. [...] Evaluation set 1 Unseen Images + Seen Molecules: The first set consists of unseen images and seen molecules. [...] Evaluation set 2 Unseen Images + Unseen Molecules: The second set includes previously unseen molecules, and images (consisting of 45,771 molecule and concentration pairs). Predicting molecular identities of previously unseen molecular perturbations corresponds to zeroshot prediction. Scaffold splitting was used to split this validation dataset from training ensuring minimal information leakage.
Hardware Specification Yes We utilized an NVIDIA A100 GPU to train Molphenix using Phenom1 and Mol GPS embeddings, which takes approximately 4.75 hours each. For loss comparison experiments, we run each model using 3 different seeds and 8 different losses, resulting in a total of 114 hours of GPU processing time. For concentration experiments we train 7 runs, one for each concentration, with 3 seeds each totalizing 21 runs per set of parameters. With 25 sets of parameters evaluated (13), that amounts to 2,500 A100 compute hours. Moreover, we employed 8 NVIDIA A100 GPUs to train CLOOME model on phenomics images, with an average of 40 hour usage per run.
Software Dependencies No The paper does not provide specific software dependencies with version numbers. It mentions libraries like RDKIT, MACCS, and Morgan for molecular fingerprints, but without version information.
Experiment Setup Yes Table 7: Hyperparameter values utilized in our proposed Mol Phenix training framework for Mol GPS version. For non-Mol GPS version γ 2.75 ζ is 1.0. Parameter Value number of seeds 3 learning rate 1e-3 weight decay 3e-3 optimizer Adam W training batch size 8192 validation batch size 12000 embedding dim 512 model size medium (38.7 M) model structure 6 Res Net Blocks + 1 Linear layer + 1 Res Net Block + 1 Linear layer epochs 100 self similarity clip val .75 learnable temperature initialization 2.302 learnable bias initialization -1.0 Distance function arctangent of l2 distance γ 1.7 ζ 0.75