EntQA: Entity Linking as Question Answering
Authors: Wenzheng Zhang, Wenyue Hua, Karl Stratos
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Ent QA achieves strong results on the GERBIL benchmarking platform. We analyze Ent QA and find that its retrieval performance is extremely strong (over 98 top-100 recall on the validation set of AIDA), verifying our hypothesis that finding relevant entities without knowing their mentions is easy. We also find that the reader makes reasonable errors such as accurately predicting missing hyperlinks or linking a mention to a correct entity that is more specific than the gold label. |
| Researcher Affiliation | Academia | Wenzheng Zhang, Wenyue Hua, Karl Stratos Department of Computer Science Rutgers University {wenzheng.zhang,wenyue.hua,karl.stratos}@rutgers.edu |
| Pseudocode | No | The paper does not include pseudocode or clearly labeled algorithm blocks. The methods are described through prose and mathematical equations. |
| Open Source Code | Yes | Code available at: https://github.com/Wenzheng Zhang/Ent QA |
| Open Datasets | Yes | We follow the established practice and report the In KB Micro F1 score on the in-domain and out-of-domain datasets used in De Cao et al. (2021). Specifically, we use the AIDA-Co NLL dataset (Hoffart et al., 2011) as the in-domain dataset... For the KB, we use the 2019 Wikipedia dump provided in the KILT benchmark (Petroni et al., 2021), which contains 5.9 million entities. |
| Dataset Splits | Yes | Specifically, we use the AIDA-Co NLL dataset (Hoffart et al., 2011) as the in-domain dataset: we train Ent QA on the training portion of AIDA, use the validation portion (AIDA-A) for development, and reserve the test portion (AIDAB) for in-domain test performance. |
| Hardware Specification | Yes | The retriever is trained on 4 GPUs (A100) for 9 hours; the reader is trained on 2 GPUs for 6 hours. |
| Software Dependencies | No | The paper mentions several software components, models, and frameworks (e.g., BLINK, ELECTRA-large, SQuAD 2.0, Faiss, Adam, BERT, BART), but it does not specify exact version numbers for any of these to ensure reproducibility. |
| Experiment Setup | Yes | We break up each document x X into overlapping passages of length L = 32 with stride S = 16 under Word Piece tokenization... We use 64 candidate entities in training for both the retriever and the reader; we use 100 candidates at test time. We predict up to P = 3 mention spans for each candidate entity. We use γ = 0.05 as the threshold... For optimization, we use Adam (Kingma & Ba, 2015) with learning rate 2e-6 for the retriever and 1e-5 for the reader; we use a linear learning rate decay schedule with warmup proportion 0.06 over 4 epochs for both modules. The batch size is 4 for the retriever and 2 for the reader. |