Pre-training Sequence, Structure, and Surface Features for Comprehensive Protein Representation Learning

Authors: Youhan Lee, Hasun Yu, Jaemyung Lee, Jaehoon Kim

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate that our approach can enhance performance in various downstream tasks, thereby underscoring the importance of including surface attributes in protein representation learning. (Abstract)Table 2: Performance on downstream tasks. (Section 5.3)Ablation study on 3D latent embedding In Protein INR, we incorporate a 3D convolution layer to introduce a spatial inductive bias to the latent space. To evaluate the effect of the approach, we conduct an analysis of the learning curve of INR when incorporating or excluding spatial inductive bias. (Section 5.3)
Researcher Affiliation Industry Youhan Lee , Hasun Yu , Jaemyung Lee , Jaehoon Kim Kakao Brain {youhan.lee,shawn.yu,james.brain,jack.brain}@kakaobrain.com
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper. Methods are described in prose and through diagrams.
Open Source Code No The paper states it uses official codes for existing methods (e.g., 'The DSPoint and KPConv, and Gear Net are implemented from their official codes.' and 'we use the well-published Torch Drug framework (Zhu et al., 2022).'), but does not provide a specific link or explicit statement about the public availability of their own Protein INR implementation or the code specific to this paper's methodology.
Open Datasets Yes To pre-train structural information, we utilize Alpha Fold Protein Structure Database version 2 (Varadi et al., 2022) to pre-train the models. We use protein structure prediction data for 20 species and Swiss-Prot (Boeckmann et al., 2003). (Section 5.1)
Dataset Splits Yes Table 6: The number of datasets for downstream tasks. Dataset # Train # Validation # Test Enzyme Commission 15,170 1,686 1,860 Gene Ontology 28,305 3,139 3,148 Fold Classification 12,312 736 718 (Appendix A.3)
Hardware Specification Yes We use batch size as 16 per step (8 A100 GPUs and 2 for each GPU) for all experiments. (Section 5.2)We use 64 NVIDIA A100 80GB gpus for pre-training. (Appendix A.1.2)
Software Dependencies No The paper mentions using 'the well-published Torch Drug framework (Zhu et al., 2022)' and that 'The DSPoint and KPConv, and Gear Net are implemented from their official codes,' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We train Protein INR in 50 epochs with learning rate of 1e-4. (Section 5.1)The model is trained for 50 epochs on EC, 200 epochs on GO, and 300 epochs on fold classification task. (Section 5.2)We use batch size as 16 per step (8 A100 GPUs and 2 for each GPU) for all experiments. (Section 5.2)Table 4 presents the hyperparameters used in pre-training of structural data for ESMGear Net-IEConv and Gear Net-IEConv. (Appendix A.1.2)