Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GNN is a Counter? Revisiting GNN for Question Answering
Authors: Kuan Wang, Yuyu Zhang, Diyi Yang, Le Song, Tao Qin
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We discover that even a very simple graph neural counter can outperform all the existing GNN modules on Commonsense QA and Open Book QA, two popular QA benchmark datasets which heavily rely on knowledge-aware reasoning. |
| Researcher Affiliation | Collaboration | 1Georgia Institute of Technology 2Microsoft Research Asia 3Bio Map 4MBZUAI EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Py Torch-style code of GSC |
| Open Source Code | No | The paper does not provide a concrete link or explicit statement about the availability of source code for the described methodology. |
| Open Datasets | Yes | We conduct extensive experiments on Commonsense QA (Talmor et al., 2019) and Open Book QA (Mihaylov et al., 2018), two popular QA benchmark datasets that heavily rely on knowledge-aware reasoning capability. |
| Dataset Splits | Yes | Hence, following the data split of Lin et al. (2019), we experiment and report the accuracy on the in-house dev (IHdev) and test (IHtest) splits. We also report the accuracy of our final system on the official test set. |
| Hardware Specification | Yes | On a single Quadro RTX6000 GPU, each GSC training only takes about 2 hours to converge, while other methods often take 10+ hours. |
| Software Dependencies | No | The paper mentions 'Py Torch-style code' and 'RAdam' optimizer but does not specify version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We use RAdam (Liu et al., 2020) as the optimizer and set the batch size to 128. The learning rate is 1e-5 for Ro BERTa and 1e-2 for GSC. The maximum number of epoch is set to 30 for Commonsense QA and 75 for Open Book QA. |