reproducibilityindex.ai

Flexible Instance-Specific Rationalization of NLP Models

Authors: George Chrysostomou, Nikolaos Aletras10545-10553

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation on four standard text classiﬁcation datasets shows that our proposed method provides more faithful, comprehensive and highly sufﬁcient explanations compared to using a ﬁxed feature scoring method, rationale length and type. 4 Experimental Setup Tasks For our experiments we use the following datasets (details in Table 1):
Researcher Affiliation	Academia	George Chrysostomou, Nikolaos Aletras Department of Computer Science, University of Shefﬁeld gchrysostomou1@shefﬁeld.ac.uk, n.aletras@shefﬁeld.ac.uk
Pseudocode	No	No structured pseudocode or algorithm blocks were found.
Open Source Code	Yes	Code for experiments available at: https://github.com/GChrysostomou/instance-speciﬁc-rationale
Open Datasets	Yes	For our experiments we use the following datasets (details in Table 1): SST: Binary sentiment classiﬁcation without neutral sentences (Socher et al. 2013). AG: News articles categorized in Science, Sports, Business, and World topics (Corso, Gulli, and Romani 2005). Evidence Inference (EV.INF.): Abstract-only biomedical articles describing randomized controlled trials. ... (Lehman et al. 2019). Multi RC (M.RC): A reading comprehension task... (Khashabi et al. 2018).
Dataset Splits	Yes	Data \|W\| C Splits Train/Dev/Test F1 N SST 18 2 6,920 / 872 / 1,821 90.1 0.2 20% AG 36 4 102,000 / 18,000 / 7,600 93.5 0.2 20% Ev.Inf. 363 3 5,789 / 684 / 720 83.0 1.6 10% M.RC 305 2 24,029 / 3,214 / 4,848 73.2 1.7 20%
Hardware Specification	No	No specific details about the hardware (e.g., GPU model, CPU type, memory) used for running experiments were provided.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) were explicitly mentioned.
Experiment Setup	Yes	For our work, we use a 2% skip rate which led to a seven-fold reduction in the time required to compute rationales for datasets comprising of long sequences, such as MRc and Ev Inf, with comparable performance in faithfulness to the slower process of removing one token at a time. We set N as the upper bound rationale length for our approach to make results comparable with ﬁxed length rationales.