Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On Guaranteed Optimal Robust Explanations for NLP Models
Authors: Emanuele La Malfa, Rhiannon Michelmore, Agnieszka M. Zbrzezny, Nicola Paoletti, Marta Kwiatkowska
IJCAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our framework on three widely used sentiment analysis tasks and texts of up to 100 words from SST, Twitter and IMDB datasets, demonstrating the effectiveness of the derived explanations1. |
| Researcher Affiliation | Academia | 1University of Oxford 2University of Warmia and Mazury, Olsztyn 3Royal Holloway, University of London |
| Pseudocode | Yes | A more detailed discussion (including the pseudo-code) is available in the supplement. |
| Open Source Code | Yes | Code available at https://github.com/EmanueleLM/OREs |
| Open Datasets | Yes | We considered 3 well-established benchmarks for sentiment analysis, namely SST [Socher et al., 2013], Twitter [Go et al., 2009] and IMDB [Maas et al., 2011] datasets. |
| Dataset Splits | No | From these, we have chosen 40 representative input texts, balancing positive and negative examples. |
| Hardware Specification | Yes | Experiments were parallelized on a server with two 24-core Intel Xenon 6252 processors and 256GB of RAM, but each instance is single-threaded and can be executed on a low-end laptop. |
| Software Dependencies | No | Both the HS and MSA algorithms have been implemented in Python and use Marabou [Katz et al., 2019] and Neurify [Wang et al., 2018] to answer robustness queries. |
| Experiment Setup | Yes | In the experiments below, we opted for the k NNbox perturbation space, as we found that the k parameter was easier to interpret and tune than the ϵ parameter for the ϵ-ball space, and improved verification time. (e.g., Figures 2, 3, 4, 7 specify k=15, k=25, k=8, k=10 for kNN boxes) |