EntQA: Entity Linking as Question Answering

Authors: Wenzheng Zhang, Wenyue Hua, Karl Stratos

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Ent QA achieves strong results on the GERBIL benchmarking platform. We analyze Ent QA and find that its retrieval performance is extremely strong (over 98 top-100 recall on the validation set of AIDA), verifying our hypothesis that finding relevant entities without knowing their mentions is easy. We also find that the reader makes reasonable errors such as accurately predicting missing hyperlinks or linking a mention to a correct entity that is more specific than the gold label.
Researcher Affiliation Academia Wenzheng Zhang, Wenyue Hua, Karl Stratos Department of Computer Science Rutgers University {wenzheng.zhang,wenyue.hua,karl.stratos}@rutgers.edu
Pseudocode No The paper does not include pseudocode or clearly labeled algorithm blocks. The methods are described through prose and mathematical equations.
Open Source Code Yes Code available at: https://github.com/Wenzheng Zhang/Ent QA
Open Datasets Yes We follow the established practice and report the In KB Micro F1 score on the in-domain and out-of-domain datasets used in De Cao et al. (2021). Specifically, we use the AIDA-Co NLL dataset (Hoffart et al., 2011) as the in-domain dataset... For the KB, we use the 2019 Wikipedia dump provided in the KILT benchmark (Petroni et al., 2021), which contains 5.9 million entities.
Dataset Splits Yes Specifically, we use the AIDA-Co NLL dataset (Hoffart et al., 2011) as the in-domain dataset: we train Ent QA on the training portion of AIDA, use the validation portion (AIDA-A) for development, and reserve the test portion (AIDAB) for in-domain test performance.
Hardware Specification Yes The retriever is trained on 4 GPUs (A100) for 9 hours; the reader is trained on 2 GPUs for 6 hours.
Software Dependencies No The paper mentions several software components, models, and frameworks (e.g., BLINK, ELECTRA-large, SQuAD 2.0, Faiss, Adam, BERT, BART), but it does not specify exact version numbers for any of these to ensure reproducibility.
Experiment Setup Yes We break up each document x X into overlapping passages of length L = 32 with stride S = 16 under Word Piece tokenization... We use 64 candidate entities in training for both the retriever and the reader; we use 100 candidates at test time. We predict up to P = 3 mention spans for each candidate entity. We use γ = 0.05 as the threshold... For optimization, we use Adam (Kingma & Ba, 2015) with learning rate 2e-6 for the retriever and 1e-5 for the reader; we use a linear learning rate decay schedule with warmup proportion 0.06 over 4 epochs for both modules. The batch size is 4 for the retriever and 2 for the reader.