Transformers Meet Directed Graphs
Authors: Simon Geisler, Yujia Li, Daniel J Mankowitz, Ali Taylan Cemgil, Stephan Günnemann, Cosmin Paduraru
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that the extra directionality information is useful in various downstream tasks, including correctness testing of sorting networks and source code understanding. Together with a data-flow-centric graph construction, our model outperforms the prior state of the art on the Open Graph Benchmark Code2 relatively by 14.7%. |
| Researcher Affiliation | Collaboration | 1Dept. of Computer Science & Munich Data Science Institute, Technical University of Munich 2Google Deep Mind. |
| Pseudocode | Yes | Algorithm D.1 Normalize Eigenvectors; Algorithm F.1 Magnetic Laplacian Positional Encodings; Algorithm K.1 Generate Sorting Network. |
| Open Source Code | Yes | Code and configuration: www.cs.cit.tum.de/daml/digraph-transformer |
| Open Datasets | Yes | We set a new state of the art on the OGB Code2 dataset (2.85% higher F1 score, 14.7% relatively) for function name prediction ( 7). |
| Dataset Splits | Yes | For the regression tasks, we sample graphs with 16 to 63, 64 to 71, and 72 to 83 nodes for train, validation, and test, respectively. ... We construct a dataset consisting of 800,000 training instances for equally probable sequence lengths 7 ptrain 11, generate the validation data with pval = 12, and assess performance on sequence lengths 13 ptest 16. |
| Hardware Specification | Yes | For the playground classification tasks 5, we train on one Nvidia Ge Force GTX 1080TI with 11 GB RAM. Regression as well as sorting network results are obtained with a V100 with 40 GB RAM. For training the models on function name prediction dataset, we used four Google Cloud TPUv4 (behaves like 8 distributed devices). |
| Software Dependencies | No | The paper mentions using JAX for experiments and various optimizers/techniques like Adam W, adaptive gradient clipping, and cosine annealing, but it does not specify version numbers for JAX or any other software libraries or dependencies. |
| Experiment Setup | Yes | We choose the hyperparameters for each model based on a random search over the important learning parameters like learning rate, weight decay, and the parameters of Adam W... We list the important hyperparameters in Table G.1. |