Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction
Authors: Zuobai Zhang, Minghao Xu, Aurelie C. Lozano, Vijil Chenthamarakshan, Payel Das, Jian Tang
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that the performance of Diff Pre T is consistently competitive on all tasks, and Siam Diff achieves new state-of-the-art performance, considering the mean ranks on all tasks. |
| Researcher Affiliation | Collaboration | Zuobai Zhang1,2 , Minghao Xu1,2 , Aurélie Lozano3, Vijil Chenthamarakshan3, Payel Das3 , Jian Tang1,4,5 *equal contribution corresponding author 1Mila Québec AI Institute, 2Université de Montréal, 3IBM Research, 4HEC Montréal, 5CIFAR AI Chair EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Siam Diff Pre-Training |
| Open Source Code | No | Code will be released upon acceptance. |
| Open Datasets | Yes | Following Zhang et al. [87], we pre-train our models with the Alpha Fold protein structure database v1 [44, 70], including 365K proteome-wide predicted structures. |
| Dataset Splits | Yes | The EC task involves 538 binary classification problems... We use dataset splits from Gligorijevi c et al. [24] with a 95% sequence identity cutoff. The ATOM3D tasks include Protein Interface Prediction (PIP), Mutation Stability Prediction (MSP), Residue Identity (RES), and Protein Structure Ranking (PSR) with different dataset splits based on sequence identity or competition year. |
| Hardware Specification | Yes | All methods are pre-trained on 4 Tesla A100 GPUs and Table 5 reports the batch sizes on each GPU. All residue-level tasks are run on 4 V100 GPUs while all atom-level tasks are run on A100 GPUs. |
| Software Dependencies | No | All these methods are developed based on Py Torch and Torch Drug [88]. (No version numbers provided for PyTorch or Torch Drug.) |
| Experiment Setup | Yes | In Diff Pre T, for structure diffusion, we use a sigmoid schedule for variances βt with the lowest variance β1 = 1e 4 and the highest variance βT = 0.1. For sequence diffusion, we simply set the cumulative transition probability to [MASK] over time steps as a linear interpolation between minimum mask rate 0.15 and maximum mask rate 1.0. The number of diffusion steps is set as 100. In Siam Diff, we adopt the same hyperparameters for multimodal diffusion models. We set the variance of torsional perturbation noises as 0.1π on the atom level and that of Gaussian perturbation noises as 0.3 on the residue level when constructing the correlated conformer. (Tables 5 and 6 also provide specific batch sizes, optimizers, and learning rates.) |