Dirichlet Diffusion Score Model for Biological Sequence Generation

Authors: Pavel Avdeyev, Chenlai Shi, Yuhao Tan, Kseniia Dudnyk, Jian Zhou

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.
Researcher Affiliation Academia 1Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, USA.
Pseudocode No The paper does not contain any clearly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes Code available at https://github.com/jzhoulab/ddsm
Open Datasets Yes FANTOM CAGE datasets were downloaded from https://fantom.gsc.riken.jp/5/datafiles/latest/. ... The human genome sequences are retrieved from hg38
Dataset Splits Yes The promoters are further split into the training, validation, and test sets based on chromosomes (chr8 and 9 for the test set, chr10 for the validation set, and all other chromosomes for the training set).
Hardware Specification No The paper mentions software like PyTorch but does not specify any particular hardware components such as GPU or CPU models used for running the experiments.
Software Dependencies Yes Table 4. Runtime of Jacobi diffusion density function computation on Py Torch 1.10.1.
Experiment Setup Yes The Sudoku transformer is a 20-block transformer architecture... For generation and solving Sudoku puzzles, we used Euler Maruyama sampler... 100k steps where k is the time-dilation factor are used. ... The Promoter Designer model has a custom-designed 1D convolutional architecture. ... The training uses s = 2 a+b Jacobi diffusion processes with maximum time 4. For sampling from the trained model, we used Euler Maruyama sampler with 100 steps. ... In the training set, we also introduce the same amount of random shift of up to +/- 100bp to the sequence and transcription initiation profile simultaneously.