UNIREX: A Unified Learning Framework for Language Model Rationale Extraction

Authors: Aaron Chan, Maziar Sanjabi, Lambert Mathias, Liang Tan, Shaoliang Nie, Xiaochang Peng, Xiang Ren, Hamed Firooz

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On five English text classification datasets, our best UNIREX configuration outperforms baselines by an average of 32.9% NRG. Plus, UNIREX rationale extractors faithfulness can even generalize to unseen datasets and tasks.
Researcher Affiliation Collaboration 1University of Southern California 2Meta AI. Correspondence to: Aaron Chan <chanaaro@usc.edu>.
Pseudocode No The provided text does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing open-source code or a link to a code repository.
Open Datasets Yes We primarily experiment with the SST (Socher et al., 2013; Carton et al., 2020), Movies (Zaidan & Eisner, 2008), Co SE (Rajani et al., 2019), Multi RC (Khashabi et al., 2018), and e-SNLI (Camburu et al., 2018) datasets, all of which have gold rationale annotations. The latter four datasets were taken from the ERASER benchmark (De Young et al., 2019).
Dataset Splits Yes Let D = {X, Y}N i=1 be a dataset, where X = {xi}N i=1 are the text inputs, Y = {y i }N i=1 are the labels, and N is the number of instances (xi, y i ) in D. We also assume D can be partitioned into train set Dtrain, dev set Ddev, and test set Dtest.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU model, CPU model, memory) used to run the experiments.
Software Dependencies No The paper mentions 'Py Torch-Lightning' but does not provide specific version numbers for it or other ancillary software components, which is required for reproducibility.
Experiment Setup Yes For all experiments, we use a learning rate of 2e 5 and effective batch size of 32. We train for a maximum of 10 epochs, with early stopping patience of 5 epochs. We only tune faithfulness and plausibility loss weights, sweeping αc = [0.5, 0.7, 1.0], αs = [0.5, 0.7, 1.0], and αp = [0.5, 0.7, 1.0]. We find that αc = 0.5 and αs = 0.5 are usually best. For each method variant, we tuned hyperparameters w.r.t. dev CNRG, computed across all hyperparameter configurations for the variant. For the batching factor β (Sec. A.4), we use 2.