Knowledge-Grounded Self-Rationalization via Extractive and Natural Language Explanations
Authors: Bodhisattwa Prasad Majumder, Oana Camburu, Thomas Lukasiewicz, Julian Mcauley
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments spanning natural language (NL) and vision-language (VL) domains, we find that REXC significantly improves the quality of both ERs and NLEs, while bridging the gap between task performance and explainability. We also show, via perturbation analysis, that the explanations from REXC exhibit necessary conditions of faithfulness. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, UC San Diego, USA. 2Department of Computer Science, University of Oxford, UK. 3Institute of Logic and Computation, TU Wien, Austria. |
| Pseudocode | No | The paper includes architectural diagrams (e.g., Figure 2) but does not provide any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ majumderb/rexc |
| Open Datasets | Yes | We experiment with three tasks of natural language and two tasks of vision-language understanding as described in Table 1. More task details are in Appendix B. Appendix B mentions datasets like Com VE (Wang et al., 2019), e-SNLI (Camburu et al., 2018), COSe (Rajani et al., 2019), e-SNLI-VE (Kayser et al., 2021), and VCR (Zellers et al., 2019), along with their licenses or statements of free availability. |
| Dataset Splits | Yes | Com VE consists of 10000/1000/1000 samples in the train/validation/test splits. e-SNLI consists of 550K/10K/10K samples in the train/validation/test splits. COSe consists of 9741/1221 samples in the train/validation splits. e-SNLI-VE consists of 401K/14K/14K samples in train/validation/test splits. VCR consists of 212K/26K/26K samples in train/validation/test splits. |
| Hardware Specification | Yes | For NL tasks, each model is trained with batch size of 4 on two 2080 Ti GPUs. For VL tasks, each model is trained with batch size of 32 on two 2080 Ti GPUs. |
| Software Dependencies | No | The paper mentions using "BART, UNITER, and GPT-2" and tokenizers like "BART tokenizer" and "BERT tokenization scheme", and optimizers such as "Adam W", but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We trained each model for maximum 5 epochs, and training was stopped using an early stopping criteria based on perplexity on the validation sets. For NL tasks, each model is trained with batch size of 4 on two 2080 Ti GPUs. For the rationale extraction step, we set both λr 0 and λr 1 to 1.0. For the knowledge selection step, we set λg 0 to 0.9, based on validation performance. The α for mixing rationale extraction and NLE generation loss is set to 0.4. We use the Adam W optimizer (Loshchilov & Hutter, 2017) for training each model, and the learning rate was set to 6.25e 5, with a linear decay of step size 10 1 per epoch. |