Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Re-TACRED: Addressing Shortcomings of the TACRED Dataset
Authors: George Stoica, Emmanouil Antonios Platanios, Barnabas Poczos13843-13850
AAAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | After verification, we observed that 23.9% of TACRED labels are incorrect. Moreover, evaluating several models on our revised dataset yields an average f1-score improvement of 14.3% and helps uncover significant relationships between the different models (rather than simply offsetting or scaling their scores by a constant factor). |
| Researcher Affiliation | Collaboration | George Stoica 1, Emmanouil Antonios Platanios 2, Barnabas Poczos 1 1 Carnegie Mellon University 2 Microsoft Semantic Machines EMAIL, EMAIL, EMAIL |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | No | We release our newly corrected TACRED labels publicly online (https://github.com/ gstoica27/Re-TACRED). Due to licensing restrictions, we cannot release complete dataset, but similar to Alt, Gabryszak, and Hennig (2020), we release a patch that contains all of our revisions. We term the corrected dataset Revised-TACRED (Re-TACRED). |
| Open Datasets | Yes | Finally, aside from our analysis we also release Re-TACRED, a new completely re-annotated version of the TACRED dataset that can be used to perform reliable evaluation of relation extraction models. ... We release our newly corrected TACRED labels publicly online (https://github.com/ gstoica27/Re-TACRED). |
| Dataset Splits | Yes | TACRED consists of over 106,000 sentences collected from the 2009-2014 TAC knowledge base population (KBP) evaluations, with those between 2009-2012 used for training, 2013 for development, and 2014 for testing. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were mentioned. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly mentioned. |
| Experiment Setup | Yes | All results were reported using micro-averaged f1-scores from the model with the median validation f1-score over five independent runs, as in prior literature. |