Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking
Authors: Michael Sejr Schlichtkrull, Nicola De Cao, Ivan Titov
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that such a classifier can be trained in a fully differentiable fashion, employing stochastic gates and encouraging sparsity through the expected L0 norm. We use our technique as an attribution method to analyse GNN models for two tasks question answering and semantic role labelling providing insights into the information flow in these models. We demonstrate using artificial data the shortcomings of the closest existing method, and show how our method addresses those shortcomings and improves faithfulness. We use GRAPHMASK to analyse GNN models for two NLP tasks: semantic role labeling (Marcheggiani & Titov, 2017) and multi-hop question answering (De Cao et al., 2019). |
| Researcher Affiliation | Academia | 1University of Amsterdam, 2University of Edinburgh m.s.schlichtkrull@uva.nl, n.decao@uva.nl, ititov@inf.ed.ac.uk |
| Pseudocode | No | Not found. The paper provides mathematical equations describing the GNN and GRAPHMASK formulations (e.g., Equations 1-13) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Source code available at https://github.com/Mich Schli/Graph Mask. |
| Open Datasets | Yes | SRL We used the English Co NLL-2009 shared task dataset (Hajiˇc et al., 2009). This dataset contains 179.014 training predicates, 6390 validation predicates, and 10498 test predicates. The dataset can be accessed at https://ufal.mff.cuni.cz/conll2009-st/. QA For question answering, we used the Wiki Hop dataset (Welbl et al., 2018), and the preprocessing script from De Cao et al. (2019). See Table 4 for details. The dataset can be accessed at https://qangaroo.cs.ucl.ac.uk/. |
| Dataset Splits | Yes | SRL We used the English Co NLL-2009 shared task dataset (Hajiˇc et al., 2009). This dataset contains 179.014 training predicates, 6390 validation predicates, and 10498 test predicates. |
| Hardware Specification | Yes | We carried out all experiments on a single Titan X-GPU. |
| Software Dependencies | No | Thus, we employ Adam (Kingma & Ba, 2015) with initial learning rate 1e 4 for GRAPHMASK, and RMSProp (Tieleman & Hinton, 2012) with learning rate 1e 2 for λ. |
| Experiment Setup | Yes | When training GRAPHMASK, we found it helpful to employ a regime wherein gates are progressively added to layers, starting from the top. For a model with K layers, we begin by adding gates only for layer k, and train the parameters for these gates for δ iterations. We then add gates for the next layer k 1, train all sets of gates for another δ iterations, and continue downwards in this manner. Optimising for sparsity under the performance constraint using the development set, we found the method to perform best with δ = 1 for SRL, while the optimal setting for QA was δ = 3. We found it necessary to use separate optimizers and learning for the Lagrangian λ parameter and for the parameters of GRAPHMASK. Thus, we employ Adam (Kingma & Ba, 2015) with initial learning rate 1e 4 for GRAPHMASK, and RMSProp (Tieleman & Hinton, 2012) with learning rate 1e 2 for λ. For the tolerance parameter β, we found β = 0.03 to perform well for all tasks. |