Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model

Authors: Dongki Kim, Wonbin Lee, Sung Ju Hwang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results demonstrate that Mol-LLa MA is capable of comprehending the general features of molecules and providing informative responses, implying its potential as a general-purpose assistant for molecular analysis. Our project page is at https://mol-llama.github.io/.
Researcher Affiliation	Collaboration	Dongki Kim1, Wonbin Lee1, Sung Ju Hwang1,2 KAIST1, Deep Auto.ai2 EMAIL
Pseudocode	No	The paper describes methods and processes in text and flow diagrams (like Figure 1 and Figure 3), but does not contain a dedicated pseudocode or algorithm block.
Open Source Code	Yes	Our project page is at https://mol-llama.github.io/.
Open Datasets	Yes	We refer to the constructed instruction dataset as Mol-LLa MA-Instruct. We note that our dataset not only aids in understanding the molecular features but also enhances explainability and reasoning capabilities by extensively addressing fundamental molecular features and various types of interactions between users and an assistant. For more details on the dataset construction, please refer to Appendix B.1. To ensure the quality of instruction-following samples, we further filter out factually incorrect ones. Inspired by LLM-as-a-judge [62], we use GPT-4o to evaluate the factual accuracy of the samples and select those with correct content, establishing 284k instruction-following samples from the training set of the Pub Chem 324k dataset [25]. To evaluate the factual accuracy in the molecular comprehension, we employ Molecule QA benchmark [33].
Dataset Splits	Yes	Experimental Setting To assess the effectiveness of learning the general knowledge, we perform zero-shot evaluation on the PAMPA task [47]. The task is classifying the permeability of artificial membranes, requiring an understanding of essential molecular properties such as lipophilicity and molecular size. To evaluate the ability to handle diverse requests, we test on two additional prompting settings: 1) Co T [51] that instructs to provide rationales while answering and 2) prompting with task-specific information (w/ Task Info). Detailed evaluation settings are provided in Appendix C.2. We first generate the molecular conformations using RDKit and Open Babel, then split the train, valid, and test datasets using the predefined random splitting from the TDC benchmark [47].
Hardware Specification	Yes	Resources We train Mol-LLa MA on NVIDIA H100 and NVIDIA A100 80GB.
Software Dependencies	Yes	The optimizer is Adam W optimizer [31] with a weight decay of 0.05 and a cosine scheduler with 1000 steps of linear warmup where the peak and minimal learning rates are 1e-4 and 5e-6. The number of query tokens is 8 and the batch size is 256. We leverage Lo RA [19] where the rank (r) is 8, α is 32, and the dropout ratio is 0.1. We use the same optimizer configuration in the molecular representation learning stage, while training for 10 epochs with 128 batch sizes.
Experiment Setup	Yes	In the first stage, we train the blending module and the Q-Former while freezing the 2D and 3D encoders. We adopt the multi-objectives to align the molecular embeddings to the molecule-relevant texts, including molecule-text contrastive learning, molecule-text matching, and molecule-grounded text generation [24, 25]. We opt to use the IUPAC name as the molecule-relevant texts instead of using descriptions. Please refer to Section B.2 for a detailed explanation of molecular representation learning. End-to-end Instruction Tuning As shown in Fig. 1, we jointly train the blending module, QFormer, and an LLM via the multi-modal instruction tuning, while freezing the 2D and 3D encoders. We instruction-tune LLMs on the proposed instruction dataset, employing Lo RA [19] for the training efficiency. For the details of the instruction tuning of Mol-LLa MA, please refer Section B.2. We leverage Lo RA [19] where the rank (r) is 8, α is 32, and the dropout ratio is 0.1. We use the same optimizer configuration in the molecular representation learning stage, while training for 10 epochs with 128 batch sizes. We first generate the molecular conformations using RDKit and Open Babel. Then, we fine-tune molecular LLMs including Mol-LLa MA, 3D-Mo LM, Mol-Instructdions, and LLa Mo on the training dataset in Molecule QA benchmark for 20 epochs, where the total batch size is set to 256 with gradient accumulation, the learning rate is fixed to 1e-4, and the weight decay is set to 0.05 with Adam W [31] optimizer. The fine-tuned models are evaluated on the greedy decoding strategies on the test datasets.