Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences
Authors: Niklas Schmidinger, Lisa Schneckenreiter, Philipp Seidl, Johannes Schimunek, Pieter-Jan Hoedt, Johannes Brandstetter, Andreas Mayr, Sohvi Luukkonen, Sepp Hochreiter, Günter Klambauer
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments in three large domains, genomics, proteins, and chemistry, were performed to assess x LSTM s ability to model biological and chemical sequences. The results show that models based on Bio-x LSTM a) can serve as proficient generative models for DNA, protein, and chemical sequences, b) learn rich representations for those modalities, and c) can perform in-context learning for proteins and small molecules. Section 4, titled 'EXPERIMENTS AND RESULTS', details these experiments across DNA, protein, and chemical sequences, including performance metrics like Validation Loss, Perplexity, and FCD, often presented in figures and tables. |
| Researcher Affiliation | Collaboration | 1 ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University, Linz, Austria 2 NXAI Gmb H, Linz, Austria |
| Pseudocode | No | The paper describes the s LSTM and m LSTM architectures using mathematical equations in Sections 2.1 and 2.2, and discusses block structures in Section 2.3. Figure A1 depicts block diagrams. Appendix B.2 provides equations for parallel and chunkwise formulations. However, there are no explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor are there structured, code-like formatted procedures presented as a formal algorithm. |
| Open Source Code | Yes | REPRODUCIBILITY STATEMENT To ensure reproducibility and facilitate future research, we provide three standalone code repositories for DNA-x LSTM, Prot-x LSTM, and Chem-x LSTM, each containing the respective implementations, training scripts, evaluation procedures, and pre-processed datasets. |
| Open Datasets | Yes | In creating these models, we have taken care to train exclusively on publicly available data, such as the human reference genome, Open Protein Set, and publicly available small molecule databases. The training data for DNA-x LSTM models was sourced from the human reference genome (Church et al., 2011). Training data was sourced from the filtered Open Protein Set (Ahdritz et al., 2023). dataset derived from Ch EMBL with a context length of 100 tokens. We consider natural-products as domain and utilize the Coconut (Chandrasekhar et al., 2024) as source dataset. All from the Probes & Drugs portal (Skuta et al., 2017). product molecules from the reaction dataset USPTO-50k (Lowe, 2012). The domains bio, diversity, green, yellow, orange, and red, from ZINClick (Levré et al., 2018). Active molecules from the domains BACE, BBBP, Clintox, HIV, SIDER, Tox21, Tox21-10k, and Toxcast from Molecule Net (Wu et al., 2018). Active molecules from 95 bioassays from FS-MOL (Stanley et al., 2021). Active molecules from 109 bioassays from Pub Chem (Kim et al., 2023). A subset of active molecules from the BELKA challenge (Quigley et al., 2024). |
| Dataset Splits | Yes | We also use the train, validation (192 clusters), and test (500 clusters) split provided by Prot Mamba. All models are trained to generate molecules as SMILES strings (Weininger, 1988) using a CLM paradigm. The dataset used in (Özçelik et al., 2024) is derived from Ch EMBL with a random split in 1.9M training, 100k validation, and 23k test molecules. The final dataset is split at 8:1:1 into train-, validationand test-domains, sorted by their character length in descending order. For the Genomic benchmark, we perform five randomly seeded train-validation splits, fine-tune models for 10 epochs, and use early-stopping on validation performance. Final test results are reported as the mean performance max/min over the 5 seeds on a held-out test set. For the Nucleotide Transformer tasks, we use 20 epochs and 10 seeds. |
| Hardware Specification | Yes | The experiments were conducted on multiple GPU servers with A100 GPUs. Model training was performed in both single-node and multi-node setups, utilizing 1 8 A100 GPUs per node. Prot-x LSTM-102M training with a context length of 218 was completed on a node with 8 H200 GPUs. The largest models were trained across up to four nodes using distributed data parallelism. Some experiments leveraged compute resources provided by Euro HPC Joint Undertaking clusters, including Karolina at IT4Innovations, Leonardo at CINECA, and Melu Xina at Lux Provide. |
| Software Dependencies | No | The paper mentions several software components and frameworks, such as 'Py Torch' (for LSTM), 'GPT-2' (based on Transformer architecture), 'S4 model with the implementation from Gu et al. (2022)', 'Mamba model, using the official repository provided with (Gu and Dao, 2024)', 'Adam optimizer (Kingma and Ba, 2015)', and 'RDKit (Landrum, 2013)'. However, no specific version numbers are provided for these software dependencies, making it difficult to precisely reproduce the experimental environment. |
| Experiment Setup | Yes | For the DNA domain, we propose the DNA-x LSTM architecture to enhance sequence modeling capabilities, particularly for varying context lengths. We introduce three model configurations based on DNA-x LSTM: two s LSTM-based configurations trained with a context window of 1,024 tokens (DNA-x LSTM-500k and DNA-x LSTM-2M), and an m LSTM-based configuration trained with a context window of 32,768 tokens (DNA-x LSTM-4M). The short-context configuration, DNAx LSTM-500k, has an embedding dimension of 128, 5 s LSTM blocks, an up-projection ratio of 1.25 to match the baseline model parameter count, and a total parameter count of 500k, while DNA-x LSTM-2M has an embedding dimension of 256, 6 s LSTM blocks, a 1.0 up-projection ratio, and 2M parameters. The long-context configuration, DNA-x LSTM-4M, has an embedding dimension of 256, 9 m LSTM blocks, a 2.0 up-projection ratio, and is augmented with Rotary Position Encodings (Ro PE) (Su et al., 2024a) to handle long-range dependencies effectively, with a total of 4M parameters. All three configurations are trained with both CLM and MLM. Appendix Tables A3, A4, A6, A8, and A13 provide detailed hyperparameters including embedding dimensions, number of blocks/layers, kernel sizes, number of heads, up-projection ratios, bidirectionality settings, norm biases, QKV projection block sizes, m/s LSTM ratios, context lengths, position embedding types, optimizers (Adam W, β = (0.9, 0.95)), learning rates (e.g., 6e-3, 8e-3, 1e-2), learning rate schedules (Cosine Decay), warmup steps (1,000), weight decay (0.1), dropout (0), batch sizes (e.g., 1,024, 32), and update steps (10,000). For fine-tuning, batch sizes {64, 128, 256, 512} and learning rates {4e-4, 6e-4, 8e-4, 1e-3, 2e-3} were searched. |