Dirichlet Flow Matching with Applications to DNA Sequence Design
Authors: Hannes Stark, Bowen Jing, Chenyu Wang, Gabriele Corso, Bonnie Berger, Regina Barzilay, Tommi Jaakkola
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On complex DNA sequence generation tasks, we demonstrate superior performance compared to all baselines in distributional metrics and in achieving desired design targets for generated sequences. [...] 4. Experiments |
| Researcher Affiliation | Academia | 1CSAIL, Massachusetts Institute of Technology 2Dept. of Mathematics, Massachusetts Institute of Technology. |
| Pseudocode | Yes | Algorithm 1 TRAINING. [...] Algorithm 2 INFERENCE. |
| Open Source Code | Yes | Code is available at https://github.com/Hannes Stark/ dirichlet-flow-matching. |
| Open Datasets | Yes | We use a dataset of 100, 000 promoter sequences with 1, 024 base pairs extracted from a database of human promoters (Hon et al., 2017). [...] We evaluate on two enhancer sequence datasets from fly brain cells (Janssens et al., 2022) and from human melanoma cells (Atak et al., 2021). |
| Dataset Splits | Yes | We train for 200 epochs with a learning rate of 5 imes 10^-4 and early stopping on the MSE on the validation set. [...] For the enhancer data of 104665 fly brain cell sequences (Janssens et al., 2022), we use the same split as Taskiran et al. (2023), resulting in an 83726/10505/10434 split for train/val/test. Meanwhile, for the human melanoma cell dataset of 88870 sequences (Atak et al., 2021), their split has 70892/8966/9012 sequences. |
| Hardware Specification | Yes | Computational requirements. We train on RTX A600 GPUs. |
| Software Dependencies | No | The paper mentions using an architecture similar to DDSM (Avdeyev et al., 2023) and replacing certain normalization layers, but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow, specific libraries). |
| Experiment Setup | Yes | Toy experiments. We train all models in Figure 4 for 450,000 steps with a batch size of 512... Promoter Design. We follow the setup of Avdeyev et al. (2023) and train for 200 epochs with a learning rate of 5 imes 10^-4 and early stopping on the MSE on the validation set. [...] Enhancer Design. For both evaluations... we train for 800 epochs... For inference, we use 100 integration steps. [...] Classifier-free Guidance. ...train with a conditioning ratio... of 0.7. [...] Architecture The architecture that we use for the promoter design experiments is the same as in DDSM (Avdeyev et al., 2023). [...] The model consists of 20 layers of 1D convolutions interleaved with time embedding layers... and normalization layers. |