DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion
Authors: Qitian Wu, Chenxiao Yang, Wentao Zhao, Yixuan He, David Wipf, Junchi Yan
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks, such as node classification on large graphs, semi-supervised image/text classification, and spatial-temporal dynamics prediction. |
| Researcher Affiliation | Collaboration | Qitian Wu , Chenxiao Yang , Wentao Zhao , Yixuan He , David Wipf , Junchi Yan Department of CSE & Mo E Lab of Artificial Intelligence, Shanghai Jiao Tong University Department of Statistics, University of Oxford Amazon Web Service {echo740,chr26195,permanent,yanjunchi}@sjtu.edu.cn, yixuan.he@stats.ox.ac.uk, davidwipf@gmail.com |
| Pseudocode | Yes | We provide the Pytorch-style pseudo codes for DIFFORMER class in Alg. 1 and the one-layer propagation of two model versions (shown in Alg. 2 for DIFFORMER-s and Alg. 3 for DIFFORMER-a). The key design of our methodology lies in the model architectures which are shown in detail in Alg. 2 for DIFFORMER-s and Alg. 3 for DIFFORMER-a, where for each case, the model takes the data as input and outputs prediction for each individual instance. |
| Open Source Code | Yes | The codes are available at https://github.com/qitianwu/DIFFormer. |
| Open Datasets | Yes | We use all 13000 images from STL-10, each of which belongs to one of ten classes. We choose 1500 images from each of 10 classes of CIFAR-10 and obtain a total of 15,000 images. ... We also evaluate our model on 20Newsgroup, which is a text classification dataset consisting of 9607 instances. ... The spatial-temporal datasets are from the open-source library Py Torch Geometric Temporal (Rozemberczki et al., 2021)... |
| Dataset Splits | Yes | Following the semi-supervised learning setting in Kipf & Welling (2017), we randomly choosing 20 instances per class for training, 500/1000 instances for validation/testing for each dataset. ... For STL-10 and CIFAR-10, we randomly select 10/50/100 instances per class as training set, 1000 instances for validation and the remaining instances for testing. ... For each dataset, we split the snapshots into training, validation, and test sets according to a 2:2:6 ratio in order to make it more challenging and close to the real-world low-data learning setting. |
| Hardware Specification | Yes | In particular, we found GCN/GAT/DIFFORMER-s are still hard for full-graph training on a single V100 GPU with 16GM memory. |
| Software Dependencies | No | The paper provides 'Pytorch-style pseudo codes' in Appendix E.1, implying the use of PyTorch, but it does not specify exact version numbers for PyTorch or any other software libraries or dependencies. |
| Experiment Setup | Yes | For other hyper-paramters, we adopt grid search for all the models with learning rate from {0.0001, 0.001, 0.01, 0.1}, weight decay for the Adam optimizer from {0, 0.0001, 0.001, 0.01, 0.1, 1.0}, dropout rate from {0, 0.2, 0.5}, hidden size from {16, 32, 64}, number of layers from {2, 4, 8, 16}. For evaluation, we compute the mean and standard deviation of the results with five repeating runs with different initializations. For each run, we run for a maximum of 1000 epochs and report the testing performance achieved by the epoch yielding the best performance on validation set. |