Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA
Authors: Lifeng Qiao, Peng Ye, Yuchen Ren, Weiqiang Bai, Chaoqi Liang, Xinzhu Ma, Nanqing Dong, Wanli Ouyang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On Nucleotide Transformer Benchmarks and Genomic Benchmarks, Mx DNA demonstrates superior performance to existing methods with less pretraining data and time, highlighting its effectiveness. Finally, we show that Mx DNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining. |
| Researcher Affiliation | Academia | Lifeng Qiao1,2 , Peng Ye1,3 , Yuchen Ren1,4, Weiqiang Bai1, Chaoqi Liang1, Xinzhu Ma1,3, Nanqing Dong1, Wanli Ouyang1,3 1Shanghai Artificial Intelligence Laboratory, 2Shanghai Jiao Tong University 3The Chinese University of Hong Kong, 4The University of Sydney yepeng@pjlab.org.cn |
| Pseudocode | Yes | A.2.1 Non-Maximum Suppression: Algorithm 1 Detailed Non-Maximum Suppression for Basic Unit Placement... A.2.2 Sparse Mixture of Convolution Experts: Algorithm 2 Detailed Sparse Convolution... A.2.3 Deformable Convolution: Algorithm 3 Detailed Deformable Convolution |
| Open Source Code | Yes | Code is available at https://github.com/qiaoqiaoLF/MxDNA. |
| Open Datasets | Yes | We download the data from https://huggingface.co/spaces/InstaDeepAI/nucleotide_transformer_benchmark for Nucleotide Transformer Benchmarks and https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks for Genomic Benchmarks. |
| Dataset Splits | Yes | To ensure fair comparison, we fully finetune all the BERT-like DNA foundation models including Nucleotide Transformer v2 [6], DNABERT [4], DNABERT2 [5], Mx DNA under same hyperparameter settings... We keep the original data splits in [6, 20]. |
| Hardware Specification | Yes | We train and evaluate the models on NVIDIA RTX 3090 and NVIDIA A100 GPUs. |
| Software Dependencies | Yes | We utilize Flash Attention [52, 53] for efficient attention calculations... PyTorch [54]... PyTorch Lightning [55]... Huggingface [56]... pybind11 [57]... Scikit-Learn [58]... Numpy [59]... Matplotlib [60]... Seaborn [61]. |
| Experiment Setup | Yes | Our Mx DNA is built on the architecture Nucleotide Transformer v2 100M model with 512 hidden units and 22 layers, totaling approximately 100M parameters... The model’s learnt tokenization module includes 10 convolution experts with kernel sizes ranging from 1 to 10, along with a deformable convolution block with a kernel size of three... pretrained on the whole Human Reference Genome [32] on masked language modeling task [1] with 15% of the nucleotides randomly masked. An auxiliary balancing loss with a weight of 0.01 is used... The model undergoes training for 500k steps... trained with a learning rate of 1e-4 and a batch size of 512. We employ the Adam W optimizer with β1 = 0.9, β2 = 0.98, ϵ = 1e-6, a weight decay of 0.01, and a cosine annealing learning rate scheduler with a linear warm-up over the first 10% of steps... All the BERT-like models are fully finetuned with a batch size of 32 and a learning rate of 3e-5. We employ the Adam W optimizer with β1 = 0.9, β2 = 0.999, ϵ = 1e-8, and a weight decay of 0.01. Models are trained for 10 epochs on Genomic Benchmarks and 20 epochs on Nucleotide Transformer Benchmarks... |