LLaMo: Large Language Model-based Molecular Graph Assistant
Authors: Jinyoung Park, Minseong Bae, Dohwan Ko, Hyunwoo J. Kim
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments demonstrate that LLa Mo shows the best performance on diverse tasks, such as molecular description generation, property prediction, and IUPAC name prediction. |
| Researcher Affiliation | Academia | Jinyoung Park Minseong Bae Dohwan Ko Hyunwoo J. Kim Department of Computer Science and Engineering, Korea University {lpmn678, bms2002, ikodoh, hyunwoojkim}@korea.ac.kr |
| Pseudocode | No | The paper describes methods in prose and figures but does not include pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code of LLa Mo is available at https://github.com/mlvlab/LLa Mo. |
| Open Datasets | Yes | For molecule description generation , and property prediction , we use the datasets derived from Pub Chem and QM9 of Molecule Net [64] as in Mol-Instructions [48]. For IUPAC name prediction, a dataset derived from [3] is used. To train the generalist variant of LLa Mo, we use a training split of molecular description generation dataset of Mol-Instructions in stage 1. In stage 2, the model is instruction-tuned with a training split of description generation and property prediction instruction dataset of Mol-Instructions, IUPAC name prediction from [3], and our GPT-generated instruction-following data. [...] Pub Chem324k is constructed by collecting 324k molecules and their associated text information from the Pub Chem database. Ch EBI-20 is the most commonly utilized benchmark in this task, consisting of selected 33,010 pairs of molecules and descriptions from Ch EBI [72]. |
| Dataset Splits | Yes | To train the generalist variant of LLa Mo, we use a training split of molecular description generation dataset of Mol-Instruction [48] in stage 1. In stage 2, the model is instruction-tuned with a training split of description generation and property prediction instruction dataset of Mol-Instructions, IUPAC name prediction from [3], and our GPT-generated instruction-following data. For the evaluation of molecular description generation and property question answering tasks, we use the test split of Mol-Instructions molecular description generation and property prediction datasets, which are sampled from Pub Chem [44] and QM9 dataset of Molecule Net [64], respectively. |
| Hardware Specification | Yes | Our experiments are run on 4 A6000 GPUs or 4 V100 GPUs and 2 A6000 GPUs for LLa MA2 and Galactica, respectively. |
| Software Dependencies | No | The paper mentions software like PyTorch, PyTorch Geometric, Huggingface transformers, PEFT, and Open Delta, but does not specify their version numbers. |
| Experiment Setup | Yes | In stage 1, the Adam W [63] optimizer is adapted with an initial learning rate of 1e-4 (minimum learning rate is 1e-5 and warmup learning rate is 1e-6). The warmup step is 1,000 and the cosine scheduler is applied. In stage 2, the initial learning rate is set to 5e-5 (minimum learning rate is 5e-6 and warmup learning rate is 5e-7). [...] We use Lo RA to train the large language model in stage 2. |