Re-TACRED: Addressing Shortcomings of the TACRED Dataset
Authors: George Stoica, Emmanouil Antonios Platanios, Barnabas Poczos13843-13850
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | After verification, we observed that 23.9% of TACRED labels are incorrect. Moreover, evaluating several models on our revised dataset yields an average f1-score improvement of 14.3% and helps uncover significant relationships between the different models (rather than simply offsetting or scaling their scores by a constant factor). |
| Researcher Affiliation | Collaboration | George Stoica 1, Emmanouil Antonios Platanios 2, Barnabas Poczos 1 1 Carnegie Mellon University 2 Microsoft Semantic Machines gis@cs.cmu.edu, emplata@microsoft.com, bapoczos@cs.cmu.edu |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | No | We release our newly corrected TACRED labels publicly online (https://github.com/ gstoica27/Re-TACRED). Due to licensing restrictions, we cannot release complete dataset, but similar to Alt, Gabryszak, and Hennig (2020), we release a patch that contains all of our revisions. We term the corrected dataset Revised-TACRED (Re-TACRED). |
| Open Datasets | Yes | Finally, aside from our analysis we also release Re-TACRED, a new completely re-annotated version of the TACRED dataset that can be used to perform reliable evaluation of relation extraction models. ... We release our newly corrected TACRED labels publicly online (https://github.com/ gstoica27/Re-TACRED). |
| Dataset Splits | Yes | TACRED consists of over 106,000 sentences collected from the 2009-2014 TAC knowledge base population (KBP) evaluations, with those between 2009-2012 used for training, 2013 for development, and 2014 for testing. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were mentioned. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly mentioned. |
| Experiment Setup | Yes | All results were reported using micro-averaged f1-scores from the model with the median validation f1-score over five independent runs, as in prior literature. |