Absorb & Escape: Overcoming Single Model Limitations in Generating Heterogeneous Genomic Sequences

Authors: Zehui Li, Yuhao Ni, Guoxuan Xia, William Beardall, Akashaditya Das, Guy-Bart Stan, Yiren Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To assess the quality of generated sequences, we conduct extensive experiments on 15 species for conditional and unconditional DNA generation. The experiment results from motif distribution, diversity checks, and genome integration tests unequivocally show that A&E outperforms state-of-the-art AR models and DMs in genomic sequence generation.
Researcher Affiliation Academia 1Imperial College London, {zehui.li22, harry.ni21, g.xia21, william.beardall15, akashaditya.das13, g.stan, a.zhao}@imperial.ac.uk
Pseudocode Yes Algorithm 1 Absorb & Escape Algorithm; Algorithm 2 Fast Absorb & Escape Algorithm
Open Source Code Yes Code is available at the Github Repo1. 1https://github.com/Zehui127/Absorb-Escape
Open Datasets Yes To better evaluate the capability of various generative algorithms in DNA generation, we construct a dataset with 15 species from the Eukaryotic Promoter Database (EPDnew)[23]. ... EPD (Ours) 160,000 Reg. & Prot.; We include the training dataset used for producing the main results, which includes 160K DNA sequences from EPD, each sequence has a length of 256 bp.
Dataset Splits No The paper mentions a 'validation dataset' for parameter tuning but does not provide specific details on how this dataset is split or its size from the main EPD dataset. For example, it does not state percentages or sample counts for a validation split.
Hardware Specification Yes All the models are implemented in Pytorch and trained on a NVIDIA A100-PCIE-40GB with a maximum wall time of 48 GPU hours per model; most of the models converged within the given time.
Software Dependencies No The paper mentions 'Pytorch' and 'Adam optimizer' and refers to 'Hugging Face' for pretrained models, but does not specify exact version numbers for these software dependencies or other libraries.
Experiment Setup Yes Adam optimizer [7] is used together with the Cosine Annealing LR [22] scheduler. The learning rate of each model are detailed in Appendix D. For Disc Diff, the VAE is trained with a learning rate of 0.0001, while the UNet is trained with a learning rate of 0.00005. Disc Diff is trained for 600 epoches; during the inference time, we use DDPM [16] sampler with 1000 denoising steps. For the Fast A&B algorithm, we set the TAbsorb to 0.80.