reproducibilityindex.ai

UNIREX: A Unified Learning Framework for Language Model Rationale Extraction

Authors: Aaron Chan, Maziar Sanjabi, Lambert Mathias, Liang Tan, Shaoliang Nie, Xiaochang Peng, Xiang Ren, Hamed Firooz

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On five English text classification datasets, our best UNIREX configuration outperforms baselines by an average of 32.9% NRG. Plus, UNIREX rationale extractors faithfulness can even generalize to unseen datasets and tasks.
Researcher Affiliation	Collaboration	1University of Southern California 2Meta AI. Correspondence to: Aaron Chan <chanaaro@usc.edu>.
Pseudocode	No	The provided text does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing open-source code or a link to a code repository.
Open Datasets	Yes	We primarily experiment with the SST (Socher et al., 2013; Carton et al., 2020), Movies (Zaidan & Eisner, 2008), Co SE (Rajani et al., 2019), Multi RC (Khashabi et al., 2018), and e-SNLI (Camburu et al., 2018) datasets, all of which have gold rationale annotations. The latter four datasets were taken from the ERASER benchmark (De Young et al., 2019).
Dataset Splits	Yes	Let D = {X, Y}N i=1 be a dataset, where X = {xi}N i=1 are the text inputs, Y = {y i }N i=1 are the labels, and N is the number of instances (xi, y i ) in D. We also assume D can be partitioned into train set Dtrain, dev set Ddev, and test set Dtest.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU model, CPU model, memory) used to run the experiments.
Software Dependencies	No	The paper mentions 'Py Torch-Lightning' but does not provide specific version numbers for it or other ancillary software components, which is required for reproducibility.
Experiment Setup	Yes	For all experiments, we use a learning rate of 2e 5 and effective batch size of 32. We train for a maximum of 10 epochs, with early stopping patience of 5 epochs. We only tune faithfulness and plausibility loss weights, sweeping αc = [0.5, 0.7, 1.0], αs = [0.5, 0.7, 1.0], and αp = [0.5, 0.7, 1.0]. We find that αc = 0.5 and αs = 0.5 are usually best. For each method variant, we tuned hyperparameters w.r.t. dev CNRG, computed across all hyperparameter configurations for the variant. For the batching factor β (Sec. A.4), we use 2.