Absorb & Escape: Overcoming Single Model Limitations in Generating Heterogeneous Genomic Sequences
Authors: Zehui Li, Yuhao Ni, Guoxuan Xia, William Beardall, Akashaditya Das, Guy-Bart Stan, Yiren Zhao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To assess the quality of generated sequences, we conduct extensive experiments on 15 species for conditional and unconditional DNA generation. The experiment results from motif distribution, diversity checks, and genome integration tests unequivocally show that A&E outperforms state-of-the-art AR models and DMs in genomic sequence generation. |
| Researcher Affiliation | Academia | 1Imperial College London, {zehui.li22, harry.ni21, g.xia21, william.beardall15, akashaditya.das13, g.stan, a.zhao}@imperial.ac.uk |
| Pseudocode | Yes | Algorithm 1 Absorb & Escape Algorithm; Algorithm 2 Fast Absorb & Escape Algorithm |
| Open Source Code | Yes | Code is available at the Github Repo1. 1https://github.com/Zehui127/Absorb-Escape |
| Open Datasets | Yes | To better evaluate the capability of various generative algorithms in DNA generation, we construct a dataset with 15 species from the Eukaryotic Promoter Database (EPDnew)[23]. ... EPD (Ours) 160,000 Reg. & Prot.; We include the training dataset used for producing the main results, which includes 160K DNA sequences from EPD, each sequence has a length of 256 bp. |
| Dataset Splits | No | The paper mentions a 'validation dataset' for parameter tuning but does not provide specific details on how this dataset is split or its size from the main EPD dataset. For example, it does not state percentages or sample counts for a validation split. |
| Hardware Specification | Yes | All the models are implemented in Pytorch and trained on a NVIDIA A100-PCIE-40GB with a maximum wall time of 48 GPU hours per model; most of the models converged within the given time. |
| Software Dependencies | No | The paper mentions 'Pytorch' and 'Adam optimizer' and refers to 'Hugging Face' for pretrained models, but does not specify exact version numbers for these software dependencies or other libraries. |
| Experiment Setup | Yes | Adam optimizer [7] is used together with the Cosine Annealing LR [22] scheduler. The learning rate of each model are detailed in Appendix D. For Disc Diff, the VAE is trained with a learning rate of 0.0001, while the UNet is trained with a learning rate of 0.00005. Disc Diff is trained for 600 epoches; during the inference time, we use DDPM [16] sampler with 1000 denoising steps. For the Fast A&B algorithm, we set the TAbsorb to 0.80. |