Protein Discovery with Discrete Walk-Jump Sampling

Authors: Nathan C. Frey, Dan Berenberg, Karina Zadorozhny, Joseph Kleinhenz, Julien Lafrance-Vanasse, Isidro Hotzel, Yan Wu, Stephen Ra, Richard Bonneau, Kyunghyun Cho, Andreas Loukas, Vladimir Gligorijevic, Saeed Saremi

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the robustness of our approach on generative modeling of antibody proteins and introduce the distributional conformity score to benchmark protein generative models. By optimizing and sampling from our models for the proposed distributional conformity score, 97-100% of generated samples are successfully expressed and purified and 70% of functional designs show equal or improved binding affinity compared to known functional antibodies on the first attempt in a single round of laboratory experiments. We validate our method with in vitro experiments.
Researcher Affiliation Collaboration 1Prescient Design, Genentech 2Antibody Engineering, Genentech 3Department of Computer Science, New York University 4Center for Data Science, New York University
Pseudocode Yes Algorithm 1: Discrete Walk-Jump Sampling
Open Source Code Yes 1https://github.com/prescient-design/walk-jump
Open Datasets Yes Sequences from the Observed Antibody Space (OAS) database (Olsen et al., 2022) are aligned according to the AHo numbering scheme (Honegger & PluÈckthun, 2001) using the ANARCI (Dunbar & Deane, 2016) package and one-hot encoded. Our model is trained only on the publicly available (Mason et al., 2021) dataset.
Dataset Splits Yes To avoid overfitting the estimator, we split the reference set into a fitting set and a validation set (Algo. 2). Sequence property metrics are condensed into a single scalar metric by computing the distributional conformity score and the normalized average Wasserstein distance Wproperty between the property distributions of samples and a validation set.
Hardware Specification No The paper mentions 'GPU time / sample' and 'GPU memory (MB)' in Table 7 but does not provide specific details on the hardware used, such as exact GPU or CPU models.
Software Dependencies Yes All models were trained with the Adam W (Loshchilov & Hutter, 2017) optimizer in Py Torch (Paszke et al., 2019).
Experiment Setup Yes We used a batch size of 256, an initial learning rate of 1 × 10−4, and trained with early stopping.