Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

Authors: Yair Schiff, Chia Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of 10x larger models that do not leverage bidirectionality or equivariance. Code to reproduce our experiments is available here.
Researcher Affiliation Academia 1Department of Computer Science, Cornell University, New York, NY USA 2Department of Computer Science, Princeton University, Princeton, NJ USA 3School of Computer Science Carnegie Mellon University, Pittsburgh, PA USA.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. It provides mathematical formulations and descriptions of its components but not in a pseudocode format.
Open Source Code Yes Code to reproduce our experiments is available here.
Open Datasets Yes Data We limit the focus of this work to human-genome related tasks. To that end, we perform all pre-training tasks on the human reference genome (Consortium et al., 2009). ... The dataset used in this task is derived from the Enformer paper (Avsec et al., 2021) and presented in Trop et al. (2023).
Dataset Splits Yes We perform 5-fold cross-validation (CV) using different random seeds, with early stopping on validation accuracy and report mean and on max/min of the 5 seeds. ... We additionally follow Dalla-Torre et al. (2023) in performing 10-fold CV using different random seeds with early stopping on the validation metric.
Hardware Specification Yes Model training and inference were run on GPUs with number of devices and machine type varying by model size during pre-training and downstream tasks. We use 3090, A5000, A6000, V100, and A100 GPUs.
Software Dependencies No In Table 8, the paper lists software libraries used (e.g., PyTorch, NumPy, scikit-learn), but it does not provide specific version numbers for these libraries, only the publication year of their respective papers, which is insufficient for reproducibility.
Experiment Setup Yes All the Mamba-based models, including Caduceus, were trained with a learning rate of 8e 3. We maintain a constant number of tokens in each batch, using 220 tokens per batch. For example, for sequence lengths of 1,024, batch size is also 1,024 and for sequence lengths of 131k (217), batch size is 8. All our models, other than Caduceus-PS, are pre-trained with RC data augmentation, where any given sequence is either unchanged or has the RC operation applied to it with equal probability. Models were trained with cosine decay and the ADAM optimization algorithm (Kingma & Ba, 2014), β1 and β2 values of 0.95 and 0.9, respectively.