Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

Authors: Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang Liu, Shu Wu, Jeffrey Yu, Liang Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on 4 downstream tasks, spanning different modalities, pretraining strategies, and task settings. Our method can surpass the full-dataset performance when up to 60%-70% of the data is pruned, which validates the effectiveness of our approach and unlocks a door to enhancing model generalization with fewer samples.
Researcher Affiliation Academia 1New Laboratory of Pattern Recognition State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences 3Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 4Academy of Mathematics and Systems Science, Chinese Academy of Sciences 5Tongji University
Pseudocode Yes We also provide the pseudo-code of Mol Peg in Algorithm 1. Algorithm 1: Molecular Data Pruning for Enhanced Generalization (Mol Peg)
Open Source Code Yes We provide our codes and README file in supplementary materials to ensure reproducibility.
Open Datasets Yes To comprehensively validate the effectiveness of our proposed Mol Peg, we conduct experiments on three datasets, i.e., HIV [34], PCBA [37], MUV [38] and QM9 [39], covering four types of molecular tasks.
Dataset Splits Yes In classification tasks, the dataset is randomly split, with an 80%/10%/10% partition for training, validation and testing, respectively. In regression tasks, the QM9 dataset is divided into 110K molecules for training, 10K for validation, and another 10K for testing.
Hardware Specification Yes We conduct all experiments on a computer server with 8 NVIDIA Ge Force RTX 3090 GPUs (with 24GB memory each) and 256 AMD EPYC 7742 CPUs.
Software Dependencies Yes All of the experiments are implemented in Python 3.7, with the following supporting libraries: Py Torch 1.10.2 [58], Py G 2.0.3 [59], RDKit 2022.03.1 [60].
Experiment Setup Yes The Adam optimizer [44] is employed for training with a batch size of 256. For classification tasks, the learning rate is set at 0.001 and we opt against using a scheduler. For regression tasks, we align with the original experimental settings of Pai NN and Sch Net, setting the learning rate to 5 10 4 and incorporating a cosine annealing scheduler. For 2D graphs, we utilize the Graph Isomorphism Network (GIN) [40] as the encoder. To ensure the generalizability of our research findings, we adopt the commonly recognized experimental settings proposed by Hu et al. [41], with 300 hidden units in each layer, and a 50% dropout ratio. The number of layers is set to 5.