reproducibilityindex.ai

Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

Authors: Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang Liu, Shu Wu, Jeffrey Yu, Liang Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on 4 downstream tasks, spanning different modalities, pretraining strategies, and task settings. Our method can surpass the full-dataset performance when up to 60%-70% of the data is pruned, which validates the effectiveness of our approach and unlocks a door to enhancing model generalization with fewer samples.
Researcher Affiliation	Academia	1New Laboratory of Pattern Recognition State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences 3Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 4Academy of Mathematics and Systems Science, Chinese Academy of Sciences 5Tongji University
Pseudocode	Yes	We also provide the pseudo-code of Mol Peg in Algorithm 1. Algorithm 1: Molecular Data Pruning for Enhanced Generalization (Mol Peg)
Open Source Code	Yes	We provide our codes and README file in supplementary materials to ensure reproducibility.
Open Datasets	Yes	To comprehensively validate the effectiveness of our proposed Mol Peg, we conduct experiments on three datasets, i.e., HIV [34], PCBA [37], MUV [38] and QM9 [39], covering four types of molecular tasks.
Dataset Splits	Yes	In classification tasks, the dataset is randomly split, with an 80%/10%/10% partition for training, validation and testing, respectively. In regression tasks, the QM9 dataset is divided into 110K molecules for training, 10K for validation, and another 10K for testing.
Hardware Specification	Yes	We conduct all experiments on a computer server with 8 NVIDIA Ge Force RTX 3090 GPUs (with 24GB memory each) and 256 AMD EPYC 7742 CPUs.
Software Dependencies	Yes	All of the experiments are implemented in Python 3.7, with the following supporting libraries: Py Torch 1.10.2 [58], Py G 2.0.3 [59], RDKit 2022.03.1 [60].
Experiment Setup	Yes	The Adam optimizer [44] is employed for training with a batch size of 256. For classification tasks, the learning rate is set at 0.001 and we opt against using a scheduler. For regression tasks, we align with the original experimental settings of Pai NN and Sch Net, setting the learning rate to 5 10 4 and incorporating a cosine annealing scheduler. For 2D graphs, we utilize the Graph Isomorphism Network (GIN) [40] as the encoder. To ensure the generalizability of our research findings, we adopt the commonly recognized experimental settings proposed by Hu et al. [41], with 300 hidden units in each layer, and a 50% dropout ratio. The number of layers is set to 5.