Structured Chemistry Reasoning with Large Language Models

Authors: Siru Ouyang, Zhuosheng Zhang, Bing Yan, Xuan Liu, Yejin Choi, Jiawei Han, Lianhui Qin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Testing across four chemistry areas quantum chemistry, mechanics, physical chemistry, and kinetics STRUCTCHEM substantially enhances GPT-4 s performance, with up to 30% peak improvement. Our analysis also underscores the unique difficulties of precise grounded reasoning in science with LLMs, highlighting a need for more research in this area.
Researcher Affiliation Collaboration 1University of Illinois Urbana-Champaign 2Shanghai Jiao Tong University 3New York University 4University of Washington 5Allen Institute for AI 6University of California San Diego.
Pseudocode Yes Algorithm 1 Confidence-based Review-and-Refinement
Open Source Code Yes Code is available at https: //github.com/ozyyshr/Struct Chem.
Open Datasets Yes In our experiments, we use four datasets taken from Sci Bench (Wang et al., 2023a). The datasets cover a wide range of subfields including quantum chemistry, physical chemistry, kinetics, and matter, etc.
Dataset Splits No No explicit mention of training/validation/test dataset splits with percentages or counts for general experimental setup. The paper mentions using a 'test set' for evaluation and 'Ps' for few-shot demonstrations, but no dedicated validation split. Specifically, "Each of the datasets is divided into two parts, Pw and Ps. Here Pw contains the majority number of problems that without solutions. Meanwhile, problems in Ps are coupled with solutions."
Hardware Specification Yes We train the models with 10 epochs and it takes around 1 hour to train on a single NVIDIA A6000 GPU.
Software Dependencies No The paper mentions using LLaMA-2-13B-chat (Touvron et al., 2023) and Vicuna-13B-v1.3 (Chiang et al., 2023) as backbone models and finetuning with the LoRA approach (Hu et al., 2022). However, it does not provide specific version numbers for general software dependencies like Python or deep learning frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes During training, we configure the batch size to 8 and the maximum learning rate to 1e-4 with a 0.03 warmup ratio. For all the experiments, the Lo RA r is set to 8, and we apply a dropout rate of 0.05. We train the models with 10 epochs... During the inference process, we also adhere to the same set of parameters: a temperature of 0.1, top p of 0.75, top k of 40, 4 beams, and a maximum generation length of 2,048.