GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning
Authors: Haiteng Zhao, Shengchao Liu, Ma Chang, Hannan Xu, Jie Fu, Zhihong Deng, Lingpeng Kong, Qi Liu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that GIMLET significantly outperforms molecule-text baselines in instruction-based zero-shot learning, even achieving closed results to supervised GNN models on tasks such as toxcast and muv. In the experiments, we investigate the following inquiries: (i) Can GIMLET effectively handle zeroshot molecule property tasks by instructions? (ii) Can GIMLET performs better by few-shot learning? (iii) What impact does model architecture have on the performance of GIMLET? (iv) How does pretraining affect the performance of GIMLET? (v) How does the form of instruction influence GIMLET for molecule zero-shot learning? |
| Researcher Affiliation | Academia | Haiteng Zhao1, Shengchao Liu2, Chang Ma3, Hannan Xu4, Jie Fu5, Zhi-Hong Deng1, Lingpeng Kong3, Qi Liu3 1 Peking University 2 Mila 3 The University of Hong Kong 4 University of Oxford 5 Hong Kong University of Science and Technology |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It describes the model architecture and mathematical formulations but no step-by-step procedure in a pseudocode format. |
| Open Source Code | Yes | The code, model, and data are available at https://github.com/zhao-ht/GIMLET. |
| Open Datasets | Yes | To this end, we select Chembl [20] as the pretraining dataset, which is widely used for supervised graph pretraining [26, 62]... First, we include large-scale datasets PCBA [72]... We also target tasks from Molecule Net [75], a popular benchmark for molecule properties prediction... We construct a dataset consisting of more than two thousand molecule tasks with corresponding instructions derived from task descriptions. We pretrain GIMLET on the molecule tasks along with instructions, enabling the model to transfer effectively to a broad range of tasks. |
| Dataset Splits | Yes | Following the standard supervised setting in previous studies [26], we adopt the Scaffold split [51] with a ratio of 0.8, 0.1, 0.1 for all the datasets, and report results on the testing sets, ensuring the comparability of our results to previous works. We first split datasets into training, validation, and testing sets in the same as the zero-shot setting. Then K samples for each class are randomly sampled from the training set as the few-shot examples, where K is the few-shot number. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It does not mention any specific hardware setup. |
| Software Dependencies | No | The paper mentions using T5 [50] as the backbone language model, but it does not provide specific version numbers for T5 or any other key software components, libraries, or solvers required to replicate the experiments. |
| Experiment Setup | No | The paper mentions that 'The details of pretraining and downstream zero-shot testing are in Appendix.' and 'In both classification tasks and regression tasks, we fine-tune the last linear layer of all models using their respective modeling loss.' However, it does not explicitly provide concrete hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or detailed system-level training configurations in the main text or appendix. |