Tailoring Self-Rationalizers with Multi-Reward Distillation

Authors: Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results on five difficult question-answering datasets Strategy QA, Qua Rel, Open Book QA, Numer Sense and QASC show that not only does MARIO improve task accuracy, but it also improves the self-rationalization quality of small LMs across the aforementioned axes better than a supervised fine-tuning (SFT) baseline.
Researcher Affiliation Academia University of Southern California University of Washington University of California Los Angeles Allen Institute for Artificial Intelligence
Pseudocode No The paper describes algorithms (QUARK and MARIO) in detail with mathematical equations in Appendix B, but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes All our codes and datasets are publicly released at https://github.com/INK-USC/Rationale_Multi_Reward_Distillation.
Open Datasets Yes We conduct experiments on 5 QA datasets: STRATEGYQA, QUAREL, OPENBOOKQA, NUMERSENSE and QASC; the task is to generate a rationale followed by the predicted answer. We report details of train, val and test splits in Appendix D. For OPENBOOKQA and QUAREL, we use the provided training dataset, tuned on the validation set and report final performances on the test set7. For NUMERSENSE, we use the train, validation and test sets as in the official Git Hub8 release. For QASC, we split the original train set into train and validation (900 questions chosen randomly for validation), and use the original validation set as the test set9.
Dataset Splits Yes For STRATEGYQA, since labels are not available for evaluation sets, we split the train set into training, validation and test sets (taken from Joshi et al. (2023)), and report scores on this test set. For OPENBOOKQA and QUAREL, we use the provided training dataset, tuned on the validation set and report final performances on the test set7. For NUMERSENSE, we use the train, validation and test sets as in the official Git Hub8 release. For QASC, we split the original train set into train and validation (900 questions chosen randomly for validation), and use the original validation set as the test set9.
Hardware Specification Yes We run all our experiments on NVIDIA Quadro RTX 8000 GPUs.
Software Dependencies No The paper mentions using 'T5-LARGE (0.7B)' and 'T5-BASE' models (with links to their HuggingFace pages), but does not specify versions for other ancillary software like Python, PyTorch, or CUDA.
Experiment Setup Yes Tables 5, 6 and 7 show the hyperparameters used to train SFT, CONSISTENCY and MARIO respectively. Note that for our MARIO training, we use SFT as the reference model (Pref(t x) from Appendix B) for the KL divergence penalty. We also use the silver rationales sampled from GPT-3 as our initial data pool D (from Appendix B). Further, during inference, we always use greedy decoding.