RNA Secondary Structure Prediction By Learning Unrolled Algorithms
Authors: Xinshi Chen, Yu Li, Ramzan Umarov, Xin Gao, Le Song
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to compare E2Efold with state-of-the-art (SOTA) methods on several RNA benchmark datasets, showing superior performance of E2Efold including: being able to predict valid RNA secondary structures including pseudoknots; running as efficient as the fastest algorithm in terms of inference time; producing structures that are visually close to the true structure; better than previous SOTA in terms of F1 score, precision and recall. 6 EXPERIMENTS We compare E2Efold with the SOTA and also the most commonly used methods in the RNA secondary structure prediction field on two benchmark datasets. |
| Researcher Affiliation | Collaboration | Xinshi Chen1 , Yu Li2 , Ramzan Umarov2, Xin Gao2, , Le Song1,3, 1Georgia Tech 2KAUST 3Ant Financial |
| Pseudocode | Yes | Algorithm 1: Post-Processing Network PPφ(U, M) Algorithm 2: Neural Cell PPcellφ |
| Open Source Code | Yes | The codes for reproducing the experimental results are released at https://github.com/ml4bio/e2efold. |
| Open Datasets | Yes | We use two benchmark datasets: (i) Archive II (Sloma & Mathews, 2016), containing 3975 RNA structures from 10 RNA types, is a widely used benchmark dataset for classical RNA folding methods. (ii) RNAStralign (Tan et al., 2017), composed of 37149 structures from 8 RNA types, is one of the most comprehensive collections of RNA structures in the market. |
| Dataset Splits | Yes | We divide RNAStralign dataset into training, testing and validation sets by stratified sampling (see details in Table 7 and Fig 6), so that each set contains all RNA types. Table 7: RNAStralign dataset splits statistics RNA type All Training Validation Testing |
| Hardware Specification | Yes | The batch size was set to fully use the GPU memory, which was 20 for the Titan Xp card. |
| Software Dependencies | No | We used Pytorch to implement the whole package of E2Efold. (No version number is specified for Pytorch or any other software.) |
| Experiment Setup | Yes | In the deep score network, we used a hyper-parameter, d, which was set as 10 in the final model, to control the model capacity. In the transformer encoder layers, we set the number of heads as 2, the dimension of the feed-forward network as 2048, the dropout rate as 0.1. As for the position encoding, we used 58 base functions to form the position feature map... In the PP network, we initialized w as 1, s as log(9), α as 0.01, β as 0.1, γα as 0.99, γβ as 0.99, and ρ as 1. We set T as 20. ... we used weighted loss and set the positive sample weight as 300. The batch size was set to fully use the GPU memory, which was 20 for the Titan Xp card. ... we set the batch size as 8. However, we updated the model s parameters every 30 steps to stabilize the training process. We pre-train the score network for 100 epochs. As for the fine-tuning... We fine-tuned the whole model for 20 epochs. |