Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Biological Sequence Design with GFlowNets
Authors: Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F. P. Dossou, Chanakya Ajit Ekbote, Jie Fu, Tianyu Zhang, Michael Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, Yoshua Bengio
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical results on several biological sequence design tasks, and we find that our method generates more diverse and novel batches with high scoring candidates compared to existing approaches. |
| Researcher Affiliation | Collaboration | 1Mila 2Universit e de Montr eal 3Mc Gill University 4Jacobs University Bremen 5New York University 6IBM 7CIFAR Fellow and AI Chair. |
| Pseudocode | Yes | Algorithm 1 Multi-Round Active Learning Algorithm 2 GFlow Net Inner Loop (with training data) |
| Open Source Code | Yes | The code is available at https://github.com/MJ10/Bio Seq-GFN-AL. |
| Open Datasets | Yes | Anti-Microbial Peptide Design: ...from the DBAASP database (Pirtskhalava et al., 2021). TF Bind 8: The data and oracle are from (Barrera et al., 2016b). GFP: The data and oracle are from (Sarkisyan et al., 2016; Rao et al., 2019a). |
| Dataset Splits | Yes | We split the above mentioned dataset into two parts: D1 and D2. D1 is available for use the algorithms, whereas, D2 is used to train the oracle, f, following (Angermueller et al., 2019), as a simulation of wet-lab experiments for the generated sequences. We use early stopping, keeping 10% of the data as a validation set. |
| Hardware Specification | No | The paper only generally states 'compute resources provided by Compute Canada' without providing specific hardware details such as GPU/CPU models or memory configurations. |
| Software Dependencies | No | The paper mentions implementing the algorithm in 'Py Torch (Paszke et al., 2019)' but does not provide a specific version number for PyTorch or list any other software dependencies with their versions. |
| Experiment Setup | Yes | Proxy: We use MLP with 2 hidden layers of dimension 2048 and Re LU activation... 25 samples with dropout rate 0.1, and weight decay of 0.0001. We use a minibatch size of 256 for training with a MSE loss, using the Adam optimizer (Kingma & Ba, 2017), with learning rate 10 4 and (β0, β1) = (0.9, 0.999). We use early stopping, keeping 10% of the data as a validation set. For UCB (µ + κσ) we use κ = 0.1. GFlow Net Generator: We parameterize the flow as a MLP with 2 hidden layers of dimension 2048... We use the trajectory balance objective... Adam optimizer with (β0, β1) = (0.9, 0.999). Table 7 shows the rest of the hyperparameters... γ, the proportion of offline trajectories to 0.5... Learning rate for log Z is set to 10 3... In each round we sample t K candidates, and pick the top K based on the proxy score, where t is set to 5 for all experiments. Table 7 lists specific hyperparameters for the GFlow Net Generator. |