SALSA: Semantically-Aware Latent Space Autoencoder

Authors: Kathryn E. Kirchoff, Travis Maxfield, Alexander Tropsha, Shawn M. Gomez

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate semantic awareness of SALSA representations by comparing to its ablated counterparts, and show empirically that SALSA learns representations that maintain 1) structural awareness, 2) physicochemical awareness, 3) biological awareness, and 4) semantic continuity.
Researcher Affiliation Academia 1Department of Computer Science, UNC Chapel Hill 2Eshelman School of Pharmacy, UNC Chapel Hill 3Department of Pharmacology, UNC Chapel Hill 4Joint Department of Biomedical Engineering at UNC Chapel Hill and NC State University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any specific links to a code repository or an explicit statement about the release of source code for the methodology.
Open Datasets Yes We utilize the dataset developed by Popova, Isayev, and Tropsha (2018), which contains approximately 1.5 million SMILES sequences sourced from the Ch EMBL database (version Ch EMBL21), a chemical database comprised of drug-like or otherwise biologicallyrelevant molecular compounds (Bento et al. 2014).
Dataset Splits No The paper describes a "Training Dataset" and several "Evaluation Sets" (Supermutant Evaluation Set, RDKit Virtual Screening Benchmark), but does not explicitly define or mention a separate "validation" split for hyperparameter tuning during training.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies Yes We normalize all SMILES using the RDKit canonicalization algorithm (RDKit 2023).
Experiment Setup Yes We use l = 8 layers for both the encoder and the decoder with a hidden dimension of size h = 512, and m = 8 heads in the multi-head attention blocks. Our main results are of models trained with S = 32 latent dimensions, although we also investigated reduced latent dimensions, S {16, 8, 4, 2}. For the contrastive loss, we set temperature τ = 0.7, following Khosla et al. (2020). The final loss computation is a weighted combination of the two terms, L = λLc + (1 λ)Lr (5) where 0 λ 1 is a hyperparameter that weights the contributions of the contrastive loss and the reconstruction loss, respectively. We train SALSA with λ = 0.5, and make comparisons to either ablation, λ = 1 and λ = 0 described later in Experiments and Analysis.