Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PScalpel: A Machine Learning-based Guider for Protein Phase-Separating Behaviour Alteration

Authors: Jia Wang, Liyan Zhu, Zhe Wang, Chenqiu Zhang, Yaoxing Wu, Jun Cui, Jianqiang Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive computational and biological experiments validate the effectiveness of PScalpel as a versatile tool for guiding alterations in protein phase separation behavior. Experimental Results
Researcher Affiliation Academia 1College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China 2MOE Key Laboratory of Gene Function and Regulation, Guangdong Province Key Laboratory of Pharmaceutical Functional Genes, State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China 3Guangdong Province Key Laboratory of Pharmaceutical Functional Genes, The First Affiliated Hospital of Sun Yat-sen University, School of Life Sciences, Sun Yat-sen University, Guangzhou, China 4National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods verbally and uses diagrams (Figure 1, Figure 2, Figure 3) to illustrate workflows, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/zly20020208/PScalpel
Open Datasets Yes To train Beta Fold, the prediction results of human protein structures of Alphafold2 on the Alpha Fold Protein Structure Database (https://alphafold.ebi.ac.uk/) are utilized. The test dataset is composed of native protein structures from the RCSB-PDB database (released on 07/20/2022) (Berman et al. 2000). We obtained a test data set containing 34 proteins from the 14th Community Wide Experience on the Critical Assessment of Techniques for Protein Structure Prediction about the CASP14. As to T3GCL, two auxiliary datasets, the clinical mutation data and the phase separation data, are downloaded from the Clin Var website (https://www.ncbi.nlm.nih.gov/clinvar/) and LLPSDB (Li et al. 2020; Ning et al. 2020) and PDB (Berman et al. 2000) respectively.
Dataset Splits Yes Totally 23,391 human protein sequences and their corresponding 0-1 matrices are obtained, among which 21,051 are used as the training set and 2,340 as the validation set. We then divided the phase separation data into two independent parts, 70% of which is used as the training set for T3GCL and 30% as the validation set. To train TLPSDM and evaluate its performance in predicting the phase separation capabilities of single proteins, we utilized mutation amino acid sequences from c GAS and TDP43 proteins as distinct single protein datasets, performing five-fold cross-validation on each.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running its experiments.
Software Dependencies No The paper mentions software components like Prot Vec, Alphafold2, and Adam optimizer but does not specify version numbers for these or other key libraries/frameworks used for implementation.
Experiment Setup Yes The feature extraction layer contains N = 6 identical blocks. Each block is a residual structural unit (He et al. 2016) whose dimension dmodel = 128. For the selfattention layer, we set h = 8. And for each head, the dimension is set as dk = dv = dmodel/h = 16. only a hidden layer of dimension 64 is added to the feed-forward neural network, and Squared Re LU activation (So et al. 2021) is used to reduce the load of calculation. The learning rate and dropout rate are set to 0.001 and 0.5 respectively, and we use Adam (Kingma and Ba 2014) as the gradient descent algorithm. The dropout rate is set to 0 and the Se LU activation (Klambauer et al. 2017) is used to guarantee the high sensitivity of the model to site changes in protein sequences. Then a fully connected feed-forward network is utilized to functionally map the generated feature vectors to the corresponding categories. Two hidden layers, with dimensions 64 and 16 respectively, are added to the feed-forward neural network. In terms of the experimental setup, it is ensured that the number of mutation sites in each sample does not exceed a small positive integer denoted as k. For this particular study, k is set to 2.