Biological Sequence Design with GFlowNets
Authors: Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F. P. Dossou, Chanakya Ajit Ekbote, Jie Fu, Tianyu Zhang, Michael Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, Yoshua Bengio
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical results on several biological sequence design tasks, and we find that our method generates more diverse and novel batches with high scoring candidates compared to existing approaches. |
| Researcher Affiliation | Collaboration | 1Mila 2Universit e de Montr eal 3Mc Gill University 4Jacobs University Bremen 5New York University 6IBM 7CIFAR Fellow and AI Chair. |
| Pseudocode | Yes | Algorithm 1 Multi-Round Active Learning Algorithm 2 GFlow Net Inner Loop (with training data) |
| Open Source Code | Yes | The code is available at https://github.com/MJ10/Bio Seq-GFN-AL. |
| Open Datasets | Yes | Anti-Microbial Peptide Design: ...from the DBAASP database (Pirtskhalava et al., 2021). TF Bind 8: The data and oracle are from (Barrera et al., 2016b). GFP: The data and oracle are from (Sarkisyan et al., 2016; Rao et al., 2019a). |
| Dataset Splits | Yes | We split the above mentioned dataset into two parts: D1 and D2. D1 is available for use the algorithms, whereas, D2 is used to train the oracle, f, following (Angermueller et al., 2019), as a simulation of wet-lab experiments for the generated sequences. We use early stopping, keeping 10% of the data as a validation set. |
| Hardware Specification | No | The paper only generally states 'compute resources provided by Compute Canada' without providing specific hardware details such as GPU/CPU models or memory configurations. |
| Software Dependencies | No | The paper mentions implementing the algorithm in 'Py Torch (Paszke et al., 2019)' but does not provide a specific version number for PyTorch or list any other software dependencies with their versions. |
| Experiment Setup | Yes | Proxy: We use MLP with 2 hidden layers of dimension 2048 and Re LU activation... 25 samples with dropout rate 0.1, and weight decay of 0.0001. We use a minibatch size of 256 for training with a MSE loss, using the Adam optimizer (Kingma & Ba, 2017), with learning rate 10 4 and (β0, β1) = (0.9, 0.999). We use early stopping, keeping 10% of the data as a validation set. For UCB (µ + κσ) we use κ = 0.1. GFlow Net Generator: We parameterize the flow as a MLP with 2 hidden layers of dimension 2048... We use the trajectory balance objective... Adam optimizer with (β0, β1) = (0.9, 0.999). Table 7 shows the rest of the hyperparameters... γ, the proportion of offline trajectories to 0.5... Learning rate for log Z is set to 10 3... In each round we sample t K candidates, and pick the top K based on the proxy score, where t is set to 5 for all experiments. Table 7 lists specific hyperparameters for the GFlow Net Generator. |