Faithfulness Measurable Masked Language Models

Authors: Andreas Madsen, Siva Reddy, Sarath Chandar

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the generality of our approach by applying it to 16 different datasets and validate it using statistical in-distribution tests. The faithfulness is then measured with 9 different importance measures.
Researcher Affiliation Collaboration 1Mila, Montreal, Canada 2Computer Engineering and Software Engineering Department, Polytechnique Montreal, Montreal, Canada 3Computer Science and Linguistics, Mc Gill University, Montreal, Canada 4Facebook CIFAR AI Chair 5Canada CIFAR AI Chair.
Pseudocode Yes Algorithm 1 Creates the mini-batches used in masked fine-tuning. Algorithm 2 Ma SF algorithm, which provides p-values under the in-distribution null-hypothesis. Algorithm 3 Measures the masked model performance given an explanation.
Open Source Code Yes The code is available at https://github.com/Andreas Madsen/faithfulness-measurable-models.
Open Datasets Yes The datasets used in this work are all public and listed below. They are all used for their intended use, which is measuring classification performance. The Diabetes and Anemia datasets are from MIMIC-III which requires a HIPPA certification for data analysis (Johnson et al., 2016). Table 2. Datasets used, all datasets are either single-sequence or sequence-pair datasets. All datasets are sourced from GLUE (Wang et al., 2019b), Super GLUE (Wang et al., 2019a), MIMIC-III (Johnson et al., 2016), or b Ab I (Weston et al., 2016).
Dataset Splits Yes To include masking support in early stopping, the validation dataset is duplicated, where one copy is unmasked, and one copy is randomly masked. Additionally, the validation dataset contains a masked copy and an unmasked copy. The validation dataset is used to develop the empirical CDFs.
Hardware Specification Yes The compute hardware specifications are in Table 5 and were the same for all experiments. Table 5. The computing hardware used. Note, that a shared user system were used, only the allocated resources are reported. CPU 12 cores, Intel Silver 4216 Cascade Lake @ 2.1GHz GPU 1x NVidia V100 (32G HBM2 memory) Memory 24 GB
Software Dependencies No We use the Hugging Face implementation of RoBERTa and the Tensor Flow framework.
Experiment Setup Yes We use Ro BERTa in size base and large, with the default GLUE hyperparameters provided by Liu et al. (2019). The hyperparameters are defined by Liu et al. (2019, Appendix C, GLUE). Although these hyperparameters are for the GLUE tasks, we use them for all tasks. The one exception, is that the maximum number of epoch is higher. This is because when masked fine-tuning require more epochs. In Table 3 we specify the max epoch parameter.