Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Label-Descriptive Patterns and Their Application to Characterizing Classification Errors

Authors: Michael A. Hedderich, Jonas Fischer, Dietrich Klakow, Jilles Vreeken

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through an extensive set of experiments we show it performs very well in practice on both synthetic and real-world data.
Researcher Affiliation	Academia	1Saarland University, Saarland Informatics Campus, Saarbr ucken, Germany. 2Max Planck Institute for Informatics, Saarbr ucken, Germany. 3CISPA Helmholtz Center for Information Security, Saarbr ucken, Germany.
Pseudocode	Yes	Algorithm 1 PREMISE ... Algorithm 2 create Candidates
Open Source Code	Yes	datasets and code available online.2 https://github.com/uds-lsv/premise
Open Datasets	Yes	We analyze the misclassiﬁcation of Visual7W (Zhu et al., 2016) and the state-of-the-art LXMERT (Tan & Bansal, 2019)... The classiﬁer is trained on the standard NER dataset Co NLL03... On Onto Notes, a dataset covering a wider range of topics... we derive transactions/instances from the around 3.4k sentences in the development set of the Penn Treebank Corpus.
Dataset Splits	Yes	We derive misclassiﬁcation data sets from applying the classiﬁers to the development sets. ... For LXMERT, the minival version of the development set is used. ... On Onto Notes, a dataset covering a wider range of topics, the performance drops to 0.61 F1 on the development set.
Hardware Specification	Yes	Experiments were performed on an Intel i7-7700 machine with 31GB RAM running Linux.
Software Dependencies	Yes	The model is trained with Gini impurity as decision criterion in the implementation from scikit-learn (Pedregosa et al., 2011). For SUBGROUP-DISCOVERY, the Py Subgroup library is used (Lemmerich & Becker, 2018)... For the LSTM+CNN+CRF classiﬁer (Ma & Hovy, 2016) for NER, we follow the speciﬁc set-up from Hedderich et al. (2020) with English Fast Text embeddings.
Experiment Setup	Yes	Here, we use it to test our candidate patterns. Fisher s exact test allows to assess statistically whether two items co-occur independently based on contingency tables. ... In all experiments, we require p < 0.01. ... The ﬁne-tuning data consists of 240 instances/sentences as two patterns did not match any training data. Fine-tuning on the additional data is performed for 30 epochs. ... We repeat all experiments 10 times and report the F1 score the harmonic mean between precision and recall as average across repetitions.