reproducibilityindex.ai

Dirichlet Diffusion Score Model for Biological Sequence Generation

Authors: Pavel Avdeyev, Chenlai Shi, Yuhao Tan, Kseniia Dudnyk, Jian Zhou

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.
Researcher Affiliation	Academia	1Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, USA.
Pseudocode	No	The paper does not contain any clearly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code	Yes	Code available at https://github.com/jzhoulab/ddsm
Open Datasets	Yes	FANTOM CAGE datasets were downloaded from https://fantom.gsc.riken.jp/5/datafiles/latest/. ... The human genome sequences are retrieved from hg38
Dataset Splits	Yes	The promoters are further split into the training, validation, and test sets based on chromosomes (chr8 and 9 for the test set, chr10 for the validation set, and all other chromosomes for the training set).
Hardware Specification	No	The paper mentions software like PyTorch but does not specify any particular hardware components such as GPU or CPU models used for running the experiments.
Software Dependencies	Yes	Table 4. Runtime of Jacobi diffusion density function computation on Py Torch 1.10.1.
Experiment Setup	Yes	The Sudoku transformer is a 20-block transformer architecture... For generation and solving Sudoku puzzles, we used Euler Maruyama sampler... 100k steps where k is the time-dilation factor are used. ... The Promoter Designer model has a custom-designed 1D convolutional architecture. ... The training uses s = 2 a+b Jacobi diffusion processes with maximum time 4. For sampling from the trained model, we used Euler Maruyama sampler with 100 steps. ... In the training set, we also introduce the same amount of random shift of up to +/- 100bp to the sequence and transcription initiation proﬁle simultaneously.