Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GreaseLM: Graph REASoning Enhanced Language Models
Authors: Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D Manning, Jure Leskovec
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results on three benchmarks in the commonsense reasoning (i.e., Commonsense QA, Openbook QA) and medical question answering (i.e., Med QA-USMLE) domains demonstrate that GREASELM can more reliably answer questions that require reasoning over both situational constraints and structured knowledge, even outperforming models 8 larger.1 |
| Researcher Affiliation | Academia | Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren Percy Liang, Christopher D. Manning, Jure Leskovec Stanford University EMAIL |
| Pseudocode | No | The paper describes its architecture and various operations using mathematical equations (e.g., Eqs. 1-11), but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | All code, data and pretrained models are available at https://github.com/snap-stanford/ Grease LM. |
| Open Datasets | Yes | We evaluate GREASELM on three diverse multiple-choice question answering datasets across two domains: Commonsense QA (Talmor et al., 2019) and Open Book QA (Mihaylov et al., 2018) as commonsense reasoning benchmarks, and Med QA-USMLE (Jin et al., 2021) as a clinical QA task. |
| Dataset Splits | Yes | We perform our experiments using the in-house data split of Lin et al. (2019) to compare to baseline methods. |
| Hardware Specification | No | The paper describes its model architecture and training process, but it does not specify any hardware details such as GPU or CPU models used for experiments. |
| Software Dependencies | No | The paper mentions using specific language models like RoBERTa-Large, Aristo RoBERTa, SapBERT, Pubmed BERT, and Bio BERT, but it does not provide specific version numbers for underlying software dependencies (e.g., Python, PyTorch/TensorFlow, CUDA). |
| Experiment Setup | Yes | Table 7: Hyperparameter settings for models and experiments |