Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

Authors: Lifeng Qiao, Peng Ye, Yuchen Ren, Weiqiang Bai, Chaoqi Liang, Xinzhu Ma, Nanqing Dong, Wanli Ouyang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On Nucleotide Transformer Benchmarks and Genomic Benchmarks, Mx DNA demonstrates superior performance to existing methods with less pretraining data and time, highlighting its effectiveness. Finally, we show that Mx DNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining.
Researcher Affiliation Academia Lifeng Qiao1,2 , Peng Ye1,3 , Yuchen Ren1,4, Weiqiang Bai1, Chaoqi Liang1, Xinzhu Ma1,3, Nanqing Dong1, Wanli Ouyang1,3 1Shanghai Artificial Intelligence Laboratory, 2Shanghai Jiao Tong University 3The Chinese University of Hong Kong, 4The University of Sydney yepeng@pjlab.org.cn
Pseudocode Yes A.2.1 Non-Maximum Suppression: Algorithm 1 Detailed Non-Maximum Suppression for Basic Unit Placement... A.2.2 Sparse Mixture of Convolution Experts: Algorithm 2 Detailed Sparse Convolution... A.2.3 Deformable Convolution: Algorithm 3 Detailed Deformable Convolution
Open Source Code Yes Code is available at https://github.com/qiaoqiaoLF/MxDNA.
Open Datasets Yes We download the data from https://huggingface.co/spaces/InstaDeepAI/nucleotide_transformer_benchmark for Nucleotide Transformer Benchmarks and https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks for Genomic Benchmarks.
Dataset Splits Yes To ensure fair comparison, we fully finetune all the BERT-like DNA foundation models including Nucleotide Transformer v2 [6], DNABERT [4], DNABERT2 [5], Mx DNA under same hyperparameter settings... We keep the original data splits in [6, 20].
Hardware Specification Yes We train and evaluate the models on NVIDIA RTX 3090 and NVIDIA A100 GPUs.
Software Dependencies Yes We utilize Flash Attention [52, 53] for efficient attention calculations... PyTorch [54]... PyTorch Lightning [55]... Huggingface [56]... pybind11 [57]... Scikit-Learn [58]... Numpy [59]... Matplotlib [60]... Seaborn [61].
Experiment Setup Yes Our Mx DNA is built on the architecture Nucleotide Transformer v2 100M model with 512 hidden units and 22 layers, totaling approximately 100M parameters... The model’s learnt tokenization module includes 10 convolution experts with kernel sizes ranging from 1 to 10, along with a deformable convolution block with a kernel size of three... pretrained on the whole Human Reference Genome [32] on masked language modeling task [1] with 15% of the nucleotides randomly masked. An auxiliary balancing loss with a weight of 0.01 is used... The model undergoes training for 500k steps... trained with a learning rate of 1e-4 and a batch size of 512. We employ the Adam W optimizer with β1 = 0.9, β2 = 0.98, ϵ = 1e-6, a weight decay of 0.01, and a cosine annealing learning rate scheduler with a linear warm-up over the first 10% of steps... All the BERT-like models are fully finetuned with a batch size of 32 and a learning rate of 3e-5. We employ the Adam W optimizer with β1 = 0.9, β2 = 0.999, ϵ = 1e-8, and a weight decay of 0.01. Models are trained for 10 epochs on Genomic Benchmarks and 20 epochs on Nucleotide Transformer Benchmarks...