Extracting Training Data from Molecular Pre-trained Models

Authors: Renhong Huang, Jiarong Xu, Zhiming Yang, Xiang Si, Xin Jiang, Hanyang Yuan, Chunping Wang, YANG YANG

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that even with only query access to molecular pre-trained models, there is a considerable risk of extracting training data, challenging the assumption that model sharing alone provides adequate protection against data extraction attacks." and "4 Experiments In this section, we evaluate the performance of molecular extraction attacks against different molecular pre-trained models. Besides, we conduct case studies, and runtime analyses to underscore the effectiveness of our approach.
Researcher Affiliation Collaboration Renhong Huang1,2 , Jiarong Xu2 , Zhiming Yang2, Xiang Si2, Xin Jiang3, Hanyang Yuan1, Chunping Wang4, Yang Yang1 1Zhejiang University, 2Fudan University, 3Lehigh University, 4Finvolution Group
Pseudocode Yes Algorithm 1 Algorithm of the proposed model
Open Source Code Yes Our codes are publicly available at: https://github.com/renH2/Molextract.
Open Datasets Yes In our experiment, we used datasets containing 2 million molecules sampled from ZINC15 [46] as the pre-training dataset G, and an additional 20,000 molecules as the auxiliary dataset Gaux. This assumption is reasonable given that such an auxiliary dataset can be sourced from publicly available molecular databases like Ch EMBL [14], Pub Chem [51], ZINC15 [46], or it could be some data held by the adversaries themselves.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits for the main datasets used in the experiments (e.g., for the ZINC15 pre-training dataset or for evaluating the attack's performance on subsets of the private data).
Hardware Specification Yes All experiments are conducted on a single machine of Linux system with an Intel Xeon Gold 5118 (128G memory) and a Ge Force GTX Tesla P4 (8GB memory).
Software Dependencies No The paper mentions 'rdkit toolkit' and 'Open AI s Spinning Up' but does not provide specific version numbers for these or other key software components, which is required for reproducible description of ancillary software.
Experiment Setup Yes Regarding the parameters for the RL agent, we set the total number of training epochs to 100, with βi as the weight for delayed reward and δ = 0.05 for intermediate rewards. ... We employ the Adam optimizer with a learning rate of 0.005 for 100 epochs during the pre-training phase. ... For policy training, we implement three policy networks with two-layer MLPs and a hidden size of 128. The graph representation network utilizes a two-layer GCN with a hidden size of 128. We update the policy network after generating 256 molecules, and set the temperature τ to 1. The policy networks are trained with the Adam optimizer, using a learning rate of 0.01 and a weight decay of 1e-6.