Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improved Off-policy Reinforcement Learning in Biological Sequence Design

Authors: Hyeonah Kim, Minsu Kim, Taeyoung Yun, Sanghyeok Choi, Emmanuel Bengio, Alex Hernández-Garcı́a, Jinkyoo Park

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments demonstrate that δ-CS significantly improves GFlow Nets, successfully discovering higher-score sequences compared to existing model-based optimization methods on diverse tasks, including DNA, RNA, protein, and peptide design.
Researcher Affiliation	Collaboration	1Mila Quebec AI Institute 2Universit e de Montr eal 3KAIST 4Valence Labs. Correspondence to: Hyeonah Kim <EMAIL>.
Pseudocode	Yes	Algorithm 1 Active Learning GFlow Nets with δ-CS Algorithm 2 Sampling with δ-CS
Open Source Code	Yes	2Available at https://github.com/hyeonahkimm/delta cs.
Open Datasets	Yes	We aim to generate DNA sequences (length L = 8) that maximize the binding affinity to the target transcription factor. Comprehensive analysis is allowed since the full sequence space is characterized by experiments (Barrera et al., 2016).... Task: The goal is to design an RNA sequence that binds to the target with the lowest binding energy, which is measured by Vienna RNA (Lorenz et al., 2011).... GFP. The objective is to identify protein sequences with high log-fluorescence intensity values.3 (Sarkisyan et al., 2016).... AAV. The aim is to discover sequences that lead to higher gene therapeutic efficiency. (Ogden et al., 2019).
Dataset Splits	Yes	The query batch size is all set as 128. For training proxy models...we use early stopping using the 10% of the dataset as a validation set and terminate the training procedure if validation loss does not improve for five consecutive iterations. For the DNA sequence design task, The initial dataset D0 is the bottom 50% in terms of the score, which results in 32, 898 samples. For RNA, we have three RNA binding tasks...whose initial datasets consist of 5,000 randomly generated sequences. For GFP...we obtain the initial dataset with \|D0\| = 10 200... For AAV...we collect an initial dataset of 15,307 sequences.
Hardware Specification	No	The paper does not provide specific details about the hardware used for experiments, such as GPU/CPU models or memory.
Software Dependencies	No	For training proxy models, we follow the procedure of (Jain et al., 2022). We use Adam (Kingma, 2015) optimizer...The full sequence s L = x is obtained after L steps, where L is the sequence length. The forward policy PF (τ; θ) is a compositional policy defined as...The policy is trained to minimize TB loss as follows. LTB(τ; θ) = log ZθPF (τ; θ) ...As described in Section 6, we employ a two-layer long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997)...For Gaussian Process Regressor (GPR), we use a default setting from the sklearn library. The paper mentions software components and optimizers but does not provide specific version numbers for Python, libraries like PyTorch, scikit-learn, or CUDA.
Experiment Setup	Yes	For training proxy models, we follow the procedure of (Jain et al., 2022). We use Adam (Kingma, 2015) optimizer with learning rate 1 10 5 and batch size of 256. The maximum proxy update is set as 3000. To prevent over-fitting, we use early stopping using the 10% of the dataset as a validation set and terminate the training procedure if validation loss does not improve for five consecutive iterations. As described in Section 6, we employ a two-layer long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997) with 512 hidden dimensions. The policy is trained with a learning rate of 5 10 4 with a batch size of 256. The learning rate of Z is set as 10 3. The coefficient κ in Section 3.2 is set as 0.1 for TF-Bind-8 and AMP with MC dropout, according to Jain et al. (2022), and 1.0 for RNA and protein design with Ensemble following Ren et al. (2022). We use a UCB acquisition function and measure the uncertainty with an ensemble of three network instances. We use δconst = 0.5 for DNA (L = 8) and RNA (L = 14) sequence design and δ = 0.05 for protein design (L = 238, 90). Lastly, we set λ to satisfy λED0(x)σ(x) 1 L based on the observations from the initial round.