Flexible Instance-Specific Rationalization of NLP Models

Authors: George Chrysostomou, Nikolaos Aletras10545-10553

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation on four standard text classification datasets shows that our proposed method provides more faithful, comprehensive and highly sufficient explanations compared to using a fixed feature scoring method, rationale length and type. 4 Experimental Setup Tasks For our experiments we use the following datasets (details in Table 1):
Researcher Affiliation Academia George Chrysostomou, Nikolaos Aletras Department of Computer Science, University of Sheffield gchrysostomou1@sheffield.ac.uk, n.aletras@sheffield.ac.uk
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code Yes Code for experiments available at: https://github.com/GChrysostomou/instance-specific-rationale
Open Datasets Yes For our experiments we use the following datasets (details in Table 1): SST: Binary sentiment classification without neutral sentences (Socher et al. 2013). AG: News articles categorized in Science, Sports, Business, and World topics (Corso, Gulli, and Romani 2005). Evidence Inference (EV.INF.): Abstract-only biomedical articles describing randomized controlled trials. ... (Lehman et al. 2019). Multi RC (M.RC): A reading comprehension task... (Khashabi et al. 2018).
Dataset Splits Yes Data |W| C Splits Train/Dev/Test F1 N SST 18 2 6,920 / 872 / 1,821 90.1 0.2 20% AG 36 4 102,000 / 18,000 / 7,600 93.5 0.2 20% Ev.Inf. 363 3 5,789 / 684 / 720 83.0 1.6 10% M.RC 305 2 24,029 / 3,214 / 4,848 73.2 1.7 20%
Hardware Specification No No specific details about the hardware (e.g., GPU model, CPU type, memory) used for running experiments were provided.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) were explicitly mentioned.
Experiment Setup Yes For our work, we use a 2% skip rate which led to a seven-fold reduction in the time required to compute rationales for datasets comprising of long sequences, such as MRc and Ev Inf, with comparable performance in faithfulness to the slower process of removing one token at a time. We set N as the upper bound rationale length for our approach to make results comparable with fixed length rationales.