Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains

Authors: Zhao Jiale, Wanru Zhuang, Jia Song, Yaqi Li, Shuqi Lu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on binding site prediction and function prediction tasks demonstrate our proposed pre-training approach significantly outperforms other methods.
Researcher Affiliation Collaboration 1DP Technology, Beijing, China 2Institute of Computing Technology,UCAS, Beijing, China 3Xiamen University, School of Informatics, Xiamen, China 4Xiamen University, Institute of Artificial Intelligence, Xiamen, China. Correspondence to: Shuqi Lu <lusq@dp.tech>.
Pseudocode No The paper describes its methods in narrative text and figures (Figure 2 shows an overview of the architecture) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Our code will be made public.
Open Datasets Yes The pre-training dataset is constructed from the Protein Data Bank and structures predicted by Alpha Fold (Jumper et al., 2021). To obtain high-quality data, structures from the Protein Data Bank Database (Rose et al., 2016) with a resolution greater than 9 are filtered out. Structures from the Alpha Fold2 Database with a p LDDT lower than 70 are filtered out. Additionally, MMSeq2 (Mirdita et al., 2021) is utilized to cluster the dataset. For EC and GO prediction, we use the same datasets as former researches (Zhang et al., 2022b). Datasets for DNA and RNA are downloaded from the Biolip and Graphbind website.
Dataset Splits Yes To construct the validation set, we used MMSeqs2 to cluster the training set so that the sequence similarity between the validation set and the training set is lower than 40%. Additionally, our dataset comprises a total of 22995 samples for training and 1006 samples for validation.
Hardware Specification Yes All training experiments are conducted with 16 Tesla A100. For the downstream task, we choose small molecule binding site prediction. All inference experiments are conducted with 8 V100 on COACH420 test set(409 samples).
Software Dependencies No The paper mentions software tools like ESM, freesasa, MMSeq2, Equibind, and Diffdock, but does not provide specific version numbers for these or any other software dependencies like programming languages or libraries.
Experiment Setup Yes Table 10 provides specific model parameters for different tasks, including LAYERS (12), NODE DIM. (256/768), EDGE DIM. (128), FFN DIM. (512/768), BATCH SIZE (32/64), KNN (90/30), TOTAL STEP (50K/200K/100K), WARMUP STEP (5K), LEARNING RATE (1E-5/5E-5), and OPTIMIZER (ADAM). Additionally, the paper states, 'In all downstream tasks, we introduce random Gaussian noise at a scale of 0.5 A to 20% of the residues in order to prevent overfitting and enhance the robustness of our model.'