Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Holistic Molecular Representation Learning via Multi-view Fragmentation
Authors: Seojin Kim, Jaehyun Nam, Junsu Kim, Hankook Lee, Sungsoo Ahn, Jinwoo Shin
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate the superiority of our Holi-Mol framework over existing pretraining methods. Specifically, our GNN pretrained by Holi-Mol on the GEOM (Axelrod & Gomez Bombarelli, 2022) dataset consistently outperforms the state-of-the-art method, Graph MVP (Liu et al., 2022b), when transferred to both Molecule Net classification (Wu et al., 2018) and QM9 regression (Ramakrishnan et al., 2014) benchmarks (see Table 1 and 2, respectively). For example, we improve the average ROC-AUC score by 74.1 75.5 over the prior art on Molecule Net. We further demonstrate the potential of Holi-Mol for other applications: semi-supervised/fully-supervised learning (see Table 3) and molecule retrieval (see Table 4). ... In this section, we extensively compare Holi-Mol with the existing molecular graph representation learning methods. We evaluate Holi-Mol and baselines on various downstream molecular property prediction tasks after pretraining on (unlabeled) molecular dataset. |
| Researcher Affiliation | Academia | Seojin Kim* EMAIL Korea Advanced Institute of Science & Technology (KAIST) Jaehyun Nam* EMAIL Korea Advanced Institute of Science & Technology (KAIST) Junsu Kim EMAIL Korea Advanced Institute of Science & Technology (KAIST) Hankook Lee EMAIL Sungkyunkwan University (SKKU) Sungsoo Ahn EMAIL Pohang University of Science and Technology (POSTECH) Jinwoo Shin EMAIL Korea Advanced Institute of Science & Technology (KAIST) |
| Pseudocode | No | The paper provides formal equations for GNN and SchNet architectures in Section C, but there are no explicit blocks labeled 'Pseudocode' or 'Algorithm' for the Holi-Mol framework itself. |
| Open Source Code | No | Our code is based on open-source codes of Graph MVP4. https://github.com/chao1224/Graph MVP |
| Open Datasets | Yes | For pretraining, we consider the GEOM (Axelrod & Gomez-Bombarelli, 2022) and the QM9 (Ramakrishnan et al., 2014) datasets, which consist of 2D and 3D paired molecular graphs. We consider (a) transfer learning on the binary classification tasks from the Molecule Net benchmark (Wu et al., 2018), and (b) transfer learning and semi-supervised learning on the regression tasks using QM9 (Ramakrishnan et al., 2014). |
| Dataset Splits | Yes | For Molecule Net experiments which splits the molecules based on their substructures. ... We use the split ratio train:validation:test = 80:10:10 for each downstream task dataset to evaluate the performance. ... For QM9 experiments, we follow the setup of Liu et al. (2021) which splits the dataset into 110,000 molecules for training, 10,000 molecules for validation, and 10,831 molecules for test. |
| Hardware Specification | Yes | We use a single NVIDIA Ge Force RTX 3090 GPU with 36 CPU cores (Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz) for self-supervised pretraining, and a single NVIDIA Ge Force RTX 2080 Ti GPU with 40 CPU cores (Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz) for fine-tuning. |
| Software Dependencies | No | The paper mentions using Adam optimizer (Kingma & Ba, 2014) and refers to GIN (Xu et al., 2019) and Sch Net (Schütt et al., 2017) as architectures, but does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | Specifically, we use a batch size of 256 and no weight decay. We use {Nodedrop, Attrmask, identity} randomly, i.e., 1/3 probability for each fragment and the original 2D molecular graphs, and Gaussian noise N(0, I) to each coordinate of 3D molecular graphs. When Nodedrop or Attrmask is used, we drop/mask the portion of 0.1 vertices from the total vertices. For self-supervised pretraining, we train for 100 epochs using Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001 and no dropout. ... we fine-tune a pretrained 2D-GNN with an initialized linear layer for 100 epochs with Adam optimizer and a learning rate of 0.001, and dropout probability of 0.5. ... We fine-tune a pretrained 2D-GNN with an initialized 2-layer multi layer perceptron for 1,000 epochs with Adam optimizer and Step LR scheduler with decay ratio of 0.5, and initial learning rate of 5e-4. |