Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Improved Off-policy Reinforcement Learning in Biological Sequence Design
Authors: Hyeonah Kim, Minsu Kim, Taeyoung Yun, Sanghyeok Choi, Emmanuel Bengio, Alex Hernández-Garcı́a, Jinkyoo Park
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments demonstrate that δ-CS significantly improves GFlow Nets, successfully discovering higher-score sequences compared to existing model-based optimization methods on diverse tasks, including DNA, RNA, protein, and peptide design. |
| Researcher Affiliation | Collaboration | 1Mila Quebec AI Institute 2Universit e de Montr eal 3KAIST 4Valence Labs. Correspondence to: Hyeonah Kim <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Active Learning GFlow Nets with δ-CS Algorithm 2 Sampling with δ-CS |
| Open Source Code | Yes | 2Available at https://github.com/hyeonahkimm/delta cs. |
| Open Datasets | Yes | We aim to generate DNA sequences (length L = 8) that maximize the binding affinity to the target transcription factor. Comprehensive analysis is allowed since the full sequence space is characterized by experiments (Barrera et al., 2016).... Task: The goal is to design an RNA sequence that binds to the target with the lowest binding energy, which is measured by Vienna RNA (Lorenz et al., 2011).... GFP. The objective is to identify protein sequences with high log-fluorescence intensity values.3 (Sarkisyan et al., 2016).... AAV. The aim is to discover sequences that lead to higher gene therapeutic efficiency. (Ogden et al., 2019). |
| Dataset Splits | Yes | The query batch size is all set as 128. For training proxy models...we use early stopping using the 10% of the dataset as a validation set and terminate the training procedure if validation loss does not improve for five consecutive iterations. For the DNA sequence design task, The initial dataset D0 is the bottom 50% in terms of the score, which results in 32, 898 samples. For RNA, we have three RNA binding tasks...whose initial datasets consist of 5,000 randomly generated sequences. For GFP...we obtain the initial dataset with |D0| = 10 200... For AAV...we collect an initial dataset of 15,307 sequences. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU/CPU models or memory. |
| Software Dependencies | No | For training proxy models, we follow the procedure of (Jain et al., 2022). We use Adam (Kingma, 2015) optimizer...The full sequence s L = x is obtained after L steps, where L is the sequence length. The forward policy PF (τ; θ) is a compositional policy defined as...The policy is trained to minimize TB loss as follows. LTB(τ; θ) = log ZθPF (τ; θ) ...As described in Section 6, we employ a two-layer long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997)...For Gaussian Process Regressor (GPR), we use a default setting from the sklearn library. The paper mentions software components and optimizers but does not provide specific version numbers for Python, libraries like PyTorch, scikit-learn, or CUDA. |
| Experiment Setup | Yes | For training proxy models, we follow the procedure of (Jain et al., 2022). We use Adam (Kingma, 2015) optimizer with learning rate 1 10 5 and batch size of 256. The maximum proxy update is set as 3000. To prevent over-fitting, we use early stopping using the 10% of the dataset as a validation set and terminate the training procedure if validation loss does not improve for five consecutive iterations. As described in Section 6, we employ a two-layer long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997) with 512 hidden dimensions. The policy is trained with a learning rate of 5 10 4 with a batch size of 256. The learning rate of Z is set as 10 3. The coefficient κ in Section 3.2 is set as 0.1 for TF-Bind-8 and AMP with MC dropout, according to Jain et al. (2022), and 1.0 for RNA and protein design with Ensemble following Ren et al. (2022). We use a UCB acquisition function and measure the uncertainty with an ensemble of three network instances. We use δconst = 0.5 for DNA (L = 8) and RNA (L = 14) sequence design and δ = 0.05 for protein design (L = 238, 90). Lastly, we set λ to satisfy λED0(x)σ(x) 1 L based on the observations from the initial round. |