Data-Centric Learning from Unlabeled Graphs with Diffusion Model
Authors: Gang Liu, Eric Inae, Tong Zhao, Jiaxin Xu, Tengfei Luo, Meng Jiang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our data-centric approach performs significantly better than fifteen existing various methods on fifteen tasks. The performance improvement brought by unlabeled data is visible as the generated labeled examples unlike the self-supervised learning. |
| Researcher Affiliation | Collaboration | Gang Liu University of Notre Dame gliu7@nd.edu Eric Inae University of Notre Dame einae@nd.edu Tong Zhao Snap Inc. tzhao@snap.com Jiaxin Xu University of Notre Dame jxu24@nd.edu Tengfei Luo University of Notre Dame tluo@nd.edu Meng Jiang University of Notre Dame mjiang2@nd.edu |
| Pseudocode | Yes | Algorithm 1 Diffusion-Based Graph Augmentation with PC Sampling, Algorithm 2 The Data-Centric Knowledge Transfer Framework: Learning from Unlabeled Graphs, Algorithm 3 The Data-Centric Knowledge Transfer Framework: Generating Task-specific Labeled Graphs |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Experiments are conducted on 15 graph property prediction tasks in chemistry, material science, and biology, including seven molecule classification, three molecule regression tasks from open graph benchmarks (Hu et al., 2020), four polymer regression tasks, and protein function prediction (PPI) (Hu et al., 2019). For semi-supervised learning methods and DCT, we use 113K QM9 (Ramakrishnan et al., 2014) and 306K PPI graphs (Hu et al., 2019) as unlabeled data sources for the tasks on molecules/polymers and proteins, respectively. |
| Dataset Splits | Yes | For all molecule datasets, we use the scaffold splitting procedure as the open graph benchmark adopted (Hu et al., 2020). For all the polymer tasks, we randomly split by 60%/10%/30% for training, validation, and test. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions software components like GIN (Graph Isomorphism Networks) but does not provide specific version numbers for any key software dependencies or libraries. |
| Experiment Setup | Yes | For DCT, we tune three major hyper-parameters: the number of perturbation steps D [1, 10], the number of negative samples M [1, 10], and top-n % labeled graphs of lowest property prediction loss selected for data augmentation. Results from Figure 4 show that DCT is robust to a wide range of D and M valued from 0 to 10. They suggest that D and M can be set as 5 in most cases. As for the number of the augmented graphs in each iteration, results show that noisy graphs are often created when n is higher than 30%, because the predictor cannot effectively guide the data augmentation for those labeled graphs whose labels are hard to predict. So, 10% is suggested as the default of top-n%. |