reproducibilityindex.ai

PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales

Authors: PeiFeng Wang, Aaron Chan, Filip Ilievski, Muhao Chen, Xiang Ren

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across four datasets, we show that PINTO significantly improves the generalization ability of the reasoning LM, yielding higher performance on both in-distribution and out-of-distribution test sets. Also, we find that PINTO s rationales are more faithful to its task predictions than those generated by competitive baselines.
Researcher Affiliation	Academia	Peifeng Wang1,2, Aaron Chan1, Filip Ilievski1,2, Muhao Chen1,2, Xiang Ren1,2 1Department of Computer Science, University of Southern California 2Information Sciences Institute, University of Southern California {peifengw, chanaaro, muhaoche, xiangren}@usc.edu, ilievski@isi.edu
Pseudocode	No	The paper describes the model architecture and training process in text and diagrams (Figure 2, 3), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data used in our experiments can be found at https://github.com/wangpf3/ pinto-faithful-language-reasoning.1
Open Datasets	Yes	Datasets We experiment with several CSR benchmarks. (1) Commonsense QA (Talmor et al., 2018) is a 5-choice QA dataset testing general commonsense reasoning about the concepts from Concept Net (Speer et al., 2017). (2) Strategy QA (Geva et al., 2021) is a binary (yes/no) QA dataset that requires models to infer the reasoning strategy. (3) Open Book QA (Mihaylov et al., 2018) is a 4-choice QA dataset that requests reasoning based on open book as well as broad commonsense knowledge. (4) QASC (Khot et al., 2020) is an 8-choice QA dataset that requires a system to answer a question with a valid composition of basic facts using common sense.
Dataset Splits	Yes	Since the gold labels for the testing sets of these datasets are not publicly available, we treat the official development set as our test set, and separate the training data into our own training set and development set.
Hardware Specification	No	The paper mentions using specific LMs like 'GPT-neox (20B)' and 'T5-base (220 million parameters)' and 'RoBERTa-Large' but does not provide details about the specific hardware (e.g., GPU models, CPU, RAM) used for training or inference.
Software Dependencies	No	The paper refers to specific language models (GPT-neox, T5-base, RoBERTa-Large) but does not list specific software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow versions, CUDA).
Experiment Setup	Yes	For the rationalizing module, we use GPT-neox (Black et al., 2022), a pretrained, autoregressive LM with 20B parameters. We manually annotate 7 examples to set up the prompt for each task dataset. For the reasoning module, we adopt T5-base (Raffel et al., 2020a) with only 220 million parameters, which is around two orders of magnitude smaller than the rationalizing module. During fine-tuning, the standard training loss (Eq. 1) and our counterfactual training loss (Eq. 2) are directly combined as the overall training loss. For perturbing rationales, we randomly choose the token masking or token replacement strategy with a equal chance in each training batch. The replacing rate for token replacement is empirically set to 30%. We run all the experiments on the compared methods 4 times using a fixed set of random seeds and report the average results.