reproducibilityindex.ai

Dirichlet Flow Matching with Applications to DNA Sequence Design

Authors: Hannes Stark, Bowen Jing, Chenyu Wang, Gabriele Corso, Bonnie Berger, Regina Barzilay, Tommi Jaakkola

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On complex DNA sequence generation tasks, we demonstrate superior performance compared to all baselines in distributional metrics and in achieving desired design targets for generated sequences. [...] 4. Experiments
Researcher Affiliation	Academia	1CSAIL, Massachusetts Institute of Technology 2Dept. of Mathematics, Massachusetts Institute of Technology.
Pseudocode	Yes	Algorithm 1 TRAINING. [...] Algorithm 2 INFERENCE.
Open Source Code	Yes	Code is available at https://github.com/Hannes Stark/ dirichlet-flow-matching.
Open Datasets	Yes	We use a dataset of 100, 000 promoter sequences with 1, 024 base pairs extracted from a database of human promoters (Hon et al., 2017). [...] We evaluate on two enhancer sequence datasets from fly brain cells (Janssens et al., 2022) and from human melanoma cells (Atak et al., 2021).
Dataset Splits	Yes	We train for 200 epochs with a learning rate of 5 imes 10^-4 and early stopping on the MSE on the validation set. [...] For the enhancer data of 104665 fly brain cell sequences (Janssens et al., 2022), we use the same split as Taskiran et al. (2023), resulting in an 83726/10505/10434 split for train/val/test. Meanwhile, for the human melanoma cell dataset of 88870 sequences (Atak et al., 2021), their split has 70892/8966/9012 sequences.
Hardware Specification	Yes	Computational requirements. We train on RTX A600 GPUs.
Software Dependencies	No	The paper mentions using an architecture similar to DDSM (Avdeyev et al., 2023) and replacing certain normalization layers, but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow, specific libraries).
Experiment Setup	Yes	Toy experiments. We train all models in Figure 4 for 450,000 steps with a batch size of 512... Promoter Design. We follow the setup of Avdeyev et al. (2023) and train for 200 epochs with a learning rate of 5 imes 10^-4 and early stopping on the MSE on the validation set. [...] Enhancer Design. For both evaluations... we train for 800 epochs... For inference, we use 100 integration steps. [...] Classifier-free Guidance. ...train with a conditioning ratio... of 0.7. [...] Architecture The architecture that we use for the promoter design experiments is the same as in DDSM (Avdeyev et al., 2023). [...] The model consists of 20 layers of 1D convolutions interleaved with time embedding layers... and normalization layers.