Protein Representation Learning by Geometric Structure Pretraining
Authors: Zuobai Zhang, Minghao Xu, Arian Rokkum Jamasb, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, Jian Tang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. |
| Researcher Affiliation | Collaboration | Mila Québec AI Institute1, Université de Montréal2, University of Cambridge3 IBM Research4, HEC Montréal5, CIFAR AI Chair6 {zuobai.zhang, minghao.xu}@mila.quebec, arj39@cam.ac.uk {ecvijil,aclozano,daspa}@us.ibm.com, jian.tang@hec.ca |
| Pseudocode | No | The paper includes Table 1 which summarizes self-prediction methods and Figure 5 illustrating them. These are not structured pseudocode blocks or algorithms. |
| Open Source Code | Yes | Our implementation is available at https://github.com/ Deep Graph Learning/Gear Net. |
| Open Datasets | Yes | We use the Alpha Fold protein structure database (CC-BY 4.0 License) (Varadi et al., 2021) for pretraining. This database contains protein structures predicted by Alpha Fold2, and we employ both 365K proteome-wide predictions and 440K Swiss-Prot (Consortium, 2021) predictions. |
| Dataset Splits | Yes | For EC and GO prediction, we follow the multi-cutoff split methods in Gligorijevi c et al. (2021) to ensure that the test set only contains PDB chains with sequence identity no more than 95% to the training set as used in Wang et al. (2022b) (See Appendix. F for results at lower identity cutoffs). For fold classification, Hou et al. (2018) provides three different test sets: Fold, in which proteins from the same superfamily are unseen during training; Superfamily, in which proteins from the same family are not present during training; and Family, in which proteins from the same family are present during training. For reaction classification, we adopt dataset splits proposed in Hermosilla et al. (2021), where proteins have less than 50% sequence similarity in-between splits. |
| Hardware Specification | Yes | All these models are trained on 4 Tesla A100 GPUs (see Appendix E.3). |
| Software Dependencies | No | The paper mentions using 'Torch Drug (Zhu et al., 2022)' for GCN implementation, but it does not specify version numbers for Torch Drug or any other major software libraries or frameworks used (e.g., PyTorch, TensorFlow, specific Python version). |
| Experiment Setup | Yes | Table 5: Hyperparameter configurations of our model on different datasets. The batch size reported in the table refers to the batch size on each GPU. All the hyperparameters are chosen by the performance on the validation set. For pretraining, we use Adam optimizer with learning rate 0.001 and train a model for 50 epochs... For Multiview Contrast, we set the cropping length of subsequence operation as 50, the radius of subspace operation as 15, the mask rate of random edge masking operation as 0.15. The temperature τ in the Info NCE loss function is set as 0.07. When pretraining Gear Net-Edge and Gear Net-Edge IEConv, we use 96 and 24 as batch sizes, respectively. |