Label-Descriptive Patterns and Their Application to Characterizing Classification Errors
Authors: Michael A. Hedderich, Jonas Fischer, Dietrich Klakow, Jilles Vreeken
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through an extensive set of experiments we show it performs very well in practice on both synthetic and real-world data. |
| Researcher Affiliation | Academia | 1Saarland University, Saarland Informatics Campus, Saarbr ucken, Germany. 2Max Planck Institute for Informatics, Saarbr ucken, Germany. 3CISPA Helmholtz Center for Information Security, Saarbr ucken, Germany. |
| Pseudocode | Yes | Algorithm 1 PREMISE ... Algorithm 2 create Candidates |
| Open Source Code | Yes | datasets and code available online.2 https://github.com/uds-lsv/premise |
| Open Datasets | Yes | We analyze the misclassification of Visual7W (Zhu et al., 2016) and the state-of-the-art LXMERT (Tan & Bansal, 2019)... The classifier is trained on the standard NER dataset Co NLL03... On Onto Notes, a dataset covering a wider range of topics... we derive transactions/instances from the around 3.4k sentences in the development set of the Penn Treebank Corpus. |
| Dataset Splits | Yes | We derive misclassification data sets from applying the classifiers to the development sets. ... For LXMERT, the minival version of the development set is used. ... On Onto Notes, a dataset covering a wider range of topics, the performance drops to 0.61 F1 on the development set. |
| Hardware Specification | Yes | Experiments were performed on an Intel i7-7700 machine with 31GB RAM running Linux. |
| Software Dependencies | Yes | The model is trained with Gini impurity as decision criterion in the implementation from scikit-learn (Pedregosa et al., 2011). For SUBGROUP-DISCOVERY, the Py Subgroup library is used (Lemmerich & Becker, 2018)... For the LSTM+CNN+CRF classifier (Ma & Hovy, 2016) for NER, we follow the specific set-up from Hedderich et al. (2020) with English Fast Text embeddings. |
| Experiment Setup | Yes | Here, we use it to test our candidate patterns. Fisher s exact test allows to assess statistically whether two items co-occur independently based on contingency tables. ... In all experiments, we require p < 0.01. ... The fine-tuning data consists of 240 instances/sentences as two patterns did not match any training data. Fine-tuning on the additional data is performed for 30 epochs. ... We repeat all experiments 10 times and report the F1 score the harmonic mean between precision and recall as average across repetitions. |