Structured Chemistry Reasoning with Large Language Models
Authors: Siru Ouyang, Zhuosheng Zhang, Bing Yan, Xuan Liu, Yejin Choi, Jiawei Han, Lianhui Qin
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Testing across four chemistry areas quantum chemistry, mechanics, physical chemistry, and kinetics STRUCTCHEM substantially enhances GPT-4 s performance, with up to 30% peak improvement. Our analysis also underscores the unique difficulties of precise grounded reasoning in science with LLMs, highlighting a need for more research in this area. |
| Researcher Affiliation | Collaboration | 1University of Illinois Urbana-Champaign 2Shanghai Jiao Tong University 3New York University 4University of Washington 5Allen Institute for AI 6University of California San Diego. |
| Pseudocode | Yes | Algorithm 1 Confidence-based Review-and-Refinement |
| Open Source Code | Yes | Code is available at https: //github.com/ozyyshr/Struct Chem. |
| Open Datasets | Yes | In our experiments, we use four datasets taken from Sci Bench (Wang et al., 2023a). The datasets cover a wide range of subfields including quantum chemistry, physical chemistry, kinetics, and matter, etc. |
| Dataset Splits | No | No explicit mention of training/validation/test dataset splits with percentages or counts for general experimental setup. The paper mentions using a 'test set' for evaluation and 'Ps' for few-shot demonstrations, but no dedicated validation split. Specifically, "Each of the datasets is divided into two parts, Pw and Ps. Here Pw contains the majority number of problems that without solutions. Meanwhile, problems in Ps are coupled with solutions." |
| Hardware Specification | Yes | We train the models with 10 epochs and it takes around 1 hour to train on a single NVIDIA A6000 GPU. |
| Software Dependencies | No | The paper mentions using LLaMA-2-13B-chat (Touvron et al., 2023) and Vicuna-13B-v1.3 (Chiang et al., 2023) as backbone models and finetuning with the LoRA approach (Hu et al., 2022). However, it does not provide specific version numbers for general software dependencies like Python or deep learning frameworks (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | During training, we configure the batch size to 8 and the maximum learning rate to 1e-4 with a 0.03 warmup ratio. For all the experiments, the Lo RA r is set to 8, and we apply a dropout rate of 0.05. We train the models with 10 epochs... During the inference process, we also adhere to the same set of parameters: a temperature of 0.1, top p of 0.75, top k of 40, 4 beams, and a maximum generation length of 2,048. |