Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Diagnostics-Guided Explanation Generation
Authors: Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, Isabelle Augenstein10445-10453
AAAI 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform experiments on three datasets from the ERASER benchmark (De Young et al. 2020a) (FEVER, Multi RC, Movies)... |
| Researcher Affiliation | Academia | Pepa Atanasova , Jakob Grue Simonsen, Christina Lioma, Isabelle Augenstein Department of Computer Science, University of Copenhagen, Denmark EMAIL |
| Pseudocode | No | The paper describes its methods in prose, detailing steps and components, but it does not include formal pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | 1We make an extended version of the manuscript and code available on https://github.com/copenlu/diagnostic-guidedexplanations . |
| Open Datasets | Yes | We perform experiments on three datasets from the ERASER benchmark (De Young et al. 2020a) (FEVER, Multi RC, Movies), all of which require complex reasoning and have sentence-level rationales. |
| Dataset Splits | No | The paper uses standard benchmark datasets but does not explicitly provide specific percentages, sample counts, or citations for how training, validation, and test splits were performed. |
| Hardware Specification | No | The paper mentions using 'BERT (Devlin et al. 2019) base-uncased as our base architecture' but does not specify any hardware details (e.g., GPU/CPU models, memory, cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions key software components like 'Transformer' and 'BERT base-uncased', but it does not provide specific version numbers for these or any other ancillary software dependencies. |
| Experiment Setup | No | The paper describes the model and training objectives, noting the use of hyperparameters like λ (for sparsity penalty) and K (for word masking), but it does not provide specific numerical values for these or other typical experimental setup details such as learning rate, batch size, or number of epochs. |