Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization
Authors: Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang Liu, Shu Wu, Jeffrey Yu, Liang Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on 4 downstream tasks, spanning different modalities, pretraining strategies, and task settings. Our method can surpass the full-dataset performance when up to 60%-70% of the data is pruned, which validates the effectiveness of our approach and unlocks a door to enhancing model generalization with fewer samples. |
| Researcher Affiliation | Academia | 1New Laboratory of Pattern Recognition State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences 3Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 4Academy of Mathematics and Systems Science, Chinese Academy of Sciences 5Tongji University |
| Pseudocode | Yes | We also provide the pseudo-code of Mol Peg in Algorithm 1. Algorithm 1: Molecular Data Pruning for Enhanced Generalization (Mol Peg) |
| Open Source Code | Yes | We provide our codes and README file in supplementary materials to ensure reproducibility. |
| Open Datasets | Yes | To comprehensively validate the effectiveness of our proposed Mol Peg, we conduct experiments on three datasets, i.e., HIV [34], PCBA [37], MUV [38] and QM9 [39], covering four types of molecular tasks. |
| Dataset Splits | Yes | In classification tasks, the dataset is randomly split, with an 80%/10%/10% partition for training, validation and testing, respectively. In regression tasks, the QM9 dataset is divided into 110K molecules for training, 10K for validation, and another 10K for testing. |
| Hardware Specification | Yes | We conduct all experiments on a computer server with 8 NVIDIA Ge Force RTX 3090 GPUs (with 24GB memory each) and 256 AMD EPYC 7742 CPUs. |
| Software Dependencies | Yes | All of the experiments are implemented in Python 3.7, with the following supporting libraries: Py Torch 1.10.2 [58], Py G 2.0.3 [59], RDKit 2022.03.1 [60]. |
| Experiment Setup | Yes | The Adam optimizer [44] is employed for training with a batch size of 256. For classification tasks, the learning rate is set at 0.001 and we opt against using a scheduler. For regression tasks, we align with the original experimental settings of Pai NN and Sch Net, setting the learning rate to 5 10 4 and incorporating a cosine annealing scheduler. For 2D graphs, we utilize the Graph Isomorphism Network (GIN) [40] as the encoder. To ensure the generalizability of our research findings, we adopt the commonly recognized experimental settings proposed by Hu et al. [41], with 300 hidden units in each layer, and a 50% dropout ratio. The number of layers is set to 5. |