UNIREX: A Unified Learning Framework for Language Model Rationale Extraction
Authors: Aaron Chan, Maziar Sanjabi, Lambert Mathias, Liang Tan, Shaoliang Nie, Xiaochang Peng, Xiang Ren, Hamed Firooz
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On five English text classification datasets, our best UNIREX configuration outperforms baselines by an average of 32.9% NRG. Plus, UNIREX rationale extractors faithfulness can even generalize to unseen datasets and tasks. |
| Researcher Affiliation | Collaboration | 1University of Southern California 2Meta AI. Correspondence to: Aaron Chan <chanaaro@usc.edu>. |
| Pseudocode | No | The provided text does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing open-source code or a link to a code repository. |
| Open Datasets | Yes | We primarily experiment with the SST (Socher et al., 2013; Carton et al., 2020), Movies (Zaidan & Eisner, 2008), Co SE (Rajani et al., 2019), Multi RC (Khashabi et al., 2018), and e-SNLI (Camburu et al., 2018) datasets, all of which have gold rationale annotations. The latter four datasets were taken from the ERASER benchmark (De Young et al., 2019). |
| Dataset Splits | Yes | Let D = {X, Y}N i=1 be a dataset, where X = {xi}N i=1 are the text inputs, Y = {y i }N i=1 are the labels, and N is the number of instances (xi, y i ) in D. We also assume D can be partitioned into train set Dtrain, dev set Ddev, and test set Dtest. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU model, CPU model, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch-Lightning' but does not provide specific version numbers for it or other ancillary software components, which is required for reproducibility. |
| Experiment Setup | Yes | For all experiments, we use a learning rate of 2e 5 and effective batch size of 32. We train for a maximum of 10 epochs, with early stopping patience of 5 epochs. We only tune faithfulness and plausibility loss weights, sweeping αc = [0.5, 0.7, 1.0], αs = [0.5, 0.7, 1.0], and αp = [0.5, 0.7, 1.0]. We find that αc = 0.5 and αs = 0.5 are usually best. For each method variant, we tuned hyperparameters w.r.t. dev CNRG, computed across all hyperparameter configurations for the variant. For the batching factor β (Sec. A.4), we use 2. |