RNA Secondary Structure Prediction By Learning Unrolled Algorithms

Authors: Xinshi Chen, Yu Li, Ramzan Umarov, Xin Gao, Le Song

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to compare E2Efold with state-of-the-art (SOTA) methods on several RNA benchmark datasets, showing superior performance of E2Efold including: being able to predict valid RNA secondary structures including pseudoknots; running as efficient as the fastest algorithm in terms of inference time; producing structures that are visually close to the true structure; better than previous SOTA in terms of F1 score, precision and recall. 6 EXPERIMENTS We compare E2Efold with the SOTA and also the most commonly used methods in the RNA secondary structure prediction field on two benchmark datasets.
Researcher Affiliation Collaboration Xinshi Chen1 , Yu Li2 , Ramzan Umarov2, Xin Gao2, , Le Song1,3, 1Georgia Tech 2KAUST 3Ant Financial
Pseudocode Yes Algorithm 1: Post-Processing Network PPφ(U, M) Algorithm 2: Neural Cell PPcellφ
Open Source Code Yes The codes for reproducing the experimental results are released at https://github.com/ml4bio/e2efold.
Open Datasets Yes We use two benchmark datasets: (i) Archive II (Sloma & Mathews, 2016), containing 3975 RNA structures from 10 RNA types, is a widely used benchmark dataset for classical RNA folding methods. (ii) RNAStralign (Tan et al., 2017), composed of 37149 structures from 8 RNA types, is one of the most comprehensive collections of RNA structures in the market.
Dataset Splits Yes We divide RNAStralign dataset into training, testing and validation sets by stratified sampling (see details in Table 7 and Fig 6), so that each set contains all RNA types. Table 7: RNAStralign dataset splits statistics RNA type All Training Validation Testing
Hardware Specification Yes The batch size was set to fully use the GPU memory, which was 20 for the Titan Xp card.
Software Dependencies No We used Pytorch to implement the whole package of E2Efold. (No version number is specified for Pytorch or any other software.)
Experiment Setup Yes In the deep score network, we used a hyper-parameter, d, which was set as 10 in the final model, to control the model capacity. In the transformer encoder layers, we set the number of heads as 2, the dimension of the feed-forward network as 2048, the dropout rate as 0.1. As for the position encoding, we used 58 base functions to form the position feature map... In the PP network, we initialized w as 1, s as log(9), α as 0.01, β as 0.1, γα as 0.99, γβ as 0.99, and ρ as 1. We set T as 20. ... we used weighted loss and set the positive sample weight as 300. The batch size was set to fully use the GPU memory, which was 20 for the Titan Xp card. ... we set the batch size as 8. However, we updated the model s parameters every 30 steps to stabilize the training process. We pre-train the score network for 100 epochs. As for the fine-tuning... We fine-tuned the whole model for 20 epochs.