Knowledge-Grounded Self-Rationalization via Extractive and Natural Language Explanations

Authors: Bodhisattwa Prasad Majumder, Oana Camburu, Thomas Lukasiewicz, Julian Mcauley

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments spanning natural language (NL) and vision-language (VL) domains, we find that REXC significantly improves the quality of both ERs and NLEs, while bridging the gap between task performance and explainability. We also show, via perturbation analysis, that the explanations from REXC exhibit necessary conditions of faithfulness.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, UC San Diego, USA. 2Department of Computer Science, University of Oxford, UK. 3Institute of Logic and Computation, TU Wien, Austria.
Pseudocode No The paper includes architectural diagrams (e.g., Figure 2) but does not provide any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes Code is available at https://github.com/ majumderb/rexc
Open Datasets Yes We experiment with three tasks of natural language and two tasks of vision-language understanding as described in Table 1. More task details are in Appendix B. Appendix B mentions datasets like Com VE (Wang et al., 2019), e-SNLI (Camburu et al., 2018), COSe (Rajani et al., 2019), e-SNLI-VE (Kayser et al., 2021), and VCR (Zellers et al., 2019), along with their licenses or statements of free availability.
Dataset Splits Yes Com VE consists of 10000/1000/1000 samples in the train/validation/test splits. e-SNLI consists of 550K/10K/10K samples in the train/validation/test splits. COSe consists of 9741/1221 samples in the train/validation splits. e-SNLI-VE consists of 401K/14K/14K samples in train/validation/test splits. VCR consists of 212K/26K/26K samples in train/validation/test splits.
Hardware Specification Yes For NL tasks, each model is trained with batch size of 4 on two 2080 Ti GPUs. For VL tasks, each model is trained with batch size of 32 on two 2080 Ti GPUs.
Software Dependencies No The paper mentions using "BART, UNITER, and GPT-2" and tokenizers like "BART tokenizer" and "BERT tokenization scheme", and optimizers such as "Adam W", but does not provide specific version numbers for these software components.
Experiment Setup Yes We trained each model for maximum 5 epochs, and training was stopped using an early stopping criteria based on perplexity on the validation sets. For NL tasks, each model is trained with batch size of 4 on two 2080 Ti GPUs. For the rationale extraction step, we set both λr 0 and λr 1 to 1.0. For the knowledge selection step, we set λg 0 to 0.9, based on validation performance. The α for mixing rationale extraction and NLE generation loss is set to 0.4. We use the Adam W optimizer (Loshchilov & Hutter, 2017) for training each model, and the learning rate was set to 6.25e 5, with a linear decay of step size 10 1 per epoch.