Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models

Authors: Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, Huajun Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To assess the real-world effectiveness of Mol-Instructions, we conduct an extensive series of evaluations. Employing the representative LLM as the foundational model, we perform instruction tuning for each of the three main categories of instructions. The results highlight the value of Mol-Instructions, demonstrating its ability to enhance the versatility and understanding of large models in the complex domain of biomolecular studies.
Researcher Affiliation Academia College of Computer Science and Technology, Zhejiang University ZJU-Ant Group Joint Research Center for Knowledge Graphs, Zhejiang University ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University {fangyin, liangxiaozhuan, kangweiliu, hrhr, zhuo.chen, fanxh}@zju.edu.cn, {zhangningyu, huajunsir}@zju.edu.cn
Pseudocode No The paper describes processes and procedures in textual form and through diagrams, but it does not include formal pseudocode blocks or algorithms.
Open Source Code Yes All data, code, and model weights can be found on Git Hub 1 and Hugging Face 2,3,4,5. For a detailed description of the dataset construction process, please refer to Appendix B.
Open Datasets Yes To address this pressing need in the biomolecular domain, we introduce Mol-Instructions (CC BY-NC-SA 4.0), a dataset tailored to the unique challenges of biomolecular studies. This dataset, as delineated in Figure 1, is structured around three core components: Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability. Our dataset and associated models are securely hosted on Git Hub and Hugging Face, which are well-recognized platforms for managing open-source projects. They provide vast accessibility and efficient management of extensive data and code repositories, guaranteeing unobstructed access to all potential users. Table 5: Data resources and licenses involved in our paper.
Dataset Splits Yes Our dataset is partitioned into training, validation, and testing subsets. The training and validation sets are used for instruction tuning, while the test set assesses model performance. The remaining samples were divided into training and validation sets at an 8:2 ratio.
Hardware Specification Yes We conduct the Lo RA training and generation on 32GB V100 GPUs while performing the full-model finetuning on the 80GB A800 GPUs.
Software Dependencies No The paper mentions several tools and libraries such as RDKit, MMseqs, and gpt-3.5-turbo with citations, but it does not provide specific version numbers for these or other key software components, which is required for a reproducible description of ancillary software.
Experiment Setup Yes The exact hyperparameters we tune for each model are shown in Table 9. Table 9: Training hyperparameters for finetuning on different datasets. QV: two linear transformation matrices on the query and value states in the self-attention module.