Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BAnG: Bidirectional Anchored Generation for Conditional RNA Design

Authors: Roman Klypa, Alberto Bietti, Sergei Grudinin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then evaluate our model on biological sequences, showing its effectiveness for conditional RNA sequence design given a binding protein. We thoroughly validate the effectiveness of our method on relevant synthetic tasks and compare it with other widely used sequence generation methods.
Researcher Affiliation Academia 1 Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. 2 Center for Computational Mathematics, Flatiron Institute, 162 5th Ave, New York, NY 10010, USA. Correspondence to: Roman Klypa <EMAIL>, Alberto Bietti <EMAIL>, Sergei Grudinin <EMAIL>.
Pseudocode Yes Algorithm A.1 Geometric Attention input si, Ti
Open Source Code Yes Software and Data. The code and the model, along with the model weights, are available at https://github. com/rsklypa/RNA-BAn G. The code allows for running the RNA sequences generation process for any correct protein 3D structure as input.
Open Datasets Yes We collected our protein-nucleotide interaction data from the Protein Data Bank (PDB) (Berman, 2000), utilizing information provided in the PPI3D database (Dapk unas et al., 2024). Additionally, to diversify RNA sequence information, we collected non-coding sequences from RNAcentral (release 24), a comprehensive database integrating RNA sequences from multiple expert sources (The RNAcentral Consortium et al., 2019). We built our test set using data from RNAcompete experiments (Ray et al., 2009), conducted by the authors of the RNA Compendium (Ray et al., 2013).
Dataset Splits Yes During training, we split the data by protein sequence homology. We measured it using the clustering provided by PPI3D (Dapk unas et al., 2024), where proteins within the same cluster share less than 40% sequence similarity with those in other clusters. We allocated samples from 95% of randomly selected clusters to the train set and the rest we used for validation.
Hardware Specification Yes Complete training took 4 days on a single MI120 AMD GPU.
Software Dependencies No The paper mentions various software tools and algorithms like Voro Contacts, CD-HIT-EST, MAFFT, MOODS, blastn, Adam optimizer, and ASMGrad, but it does not provide specific version numbers for any of these software packages or programming languages.
Experiment Setup Yes The latent dimension of the model is cs = 128, all heads dimensions are set to ch = 64. Feedforward blocks have a scaling factor n = 2. We set the number of protein and nucleotide modules to 10 each. The number of attention heads is h = 12, the number of query points is Nquery points = 4, the number of value points is Nvalue points = 8. We trained the model for each method for 80k steps with a batch size of 8, using the Adam optimizer (Kingma & Ba, 2017) with the learning rate of 0.0001. Learning rate was warmed up linearly for the first 1k steps and then decayed exponentially with γ = 0.99 and a period of 1k steps.