Biological Sequence Design with GFlowNets

Authors: Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F. P. Dossou, Chanakya Ajit Ekbote, Jie Fu, Tianyu Zhang, Michael Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, Yoshua Bengio

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present empirical results on several biological sequence design tasks, and we find that our method generates more diverse and novel batches with high scoring candidates compared to existing approaches.
Researcher Affiliation Collaboration 1Mila 2Universit e de Montr eal 3Mc Gill University 4Jacobs University Bremen 5New York University 6IBM 7CIFAR Fellow and AI Chair.
Pseudocode Yes Algorithm 1 Multi-Round Active Learning Algorithm 2 GFlow Net Inner Loop (with training data)
Open Source Code Yes The code is available at https://github.com/MJ10/Bio Seq-GFN-AL.
Open Datasets Yes Anti-Microbial Peptide Design: ...from the DBAASP database (Pirtskhalava et al., 2021). TF Bind 8: The data and oracle are from (Barrera et al., 2016b). GFP: The data and oracle are from (Sarkisyan et al., 2016; Rao et al., 2019a).
Dataset Splits Yes We split the above mentioned dataset into two parts: D1 and D2. D1 is available for use the algorithms, whereas, D2 is used to train the oracle, f, following (Angermueller et al., 2019), as a simulation of wet-lab experiments for the generated sequences. We use early stopping, keeping 10% of the data as a validation set.
Hardware Specification No The paper only generally states 'compute resources provided by Compute Canada' without providing specific hardware details such as GPU/CPU models or memory configurations.
Software Dependencies No The paper mentions implementing the algorithm in 'Py Torch (Paszke et al., 2019)' but does not provide a specific version number for PyTorch or list any other software dependencies with their versions.
Experiment Setup Yes Proxy: We use MLP with 2 hidden layers of dimension 2048 and Re LU activation... 25 samples with dropout rate 0.1, and weight decay of 0.0001. We use a minibatch size of 256 for training with a MSE loss, using the Adam optimizer (Kingma & Ba, 2017), with learning rate 10 4 and (β0, β1) = (0.9, 0.999). We use early stopping, keeping 10% of the data as a validation set. For UCB (µ + κσ) we use κ = 0.1. GFlow Net Generator: We parameterize the flow as a MLP with 2 hidden layers of dimension 2048... We use the trajectory balance objective... Adam optimizer with (β0, β1) = (0.9, 0.999). Table 7 shows the rest of the hyperparameters... γ, the proportion of offline trajectories to 0.5... Learning rate for log Z is set to 10 3... In each round we sample t K candidates, and pick the top K based on the proxy score, where t is set to 5 for all experiments. Table 7 lists specific hyperparameters for the GFlow Net Generator.