reproducibilityindex.ai

EntQA: Entity Linking as Question Answering

Authors: Wenzheng Zhang, Wenyue Hua, Karl Stratos

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Ent QA achieves strong results on the GERBIL benchmarking platform. We analyze Ent QA and ﬁnd that its retrieval performance is extremely strong (over 98 top-100 recall on the validation set of AIDA), verifying our hypothesis that ﬁnding relevant entities without knowing their mentions is easy. We also ﬁnd that the reader makes reasonable errors such as accurately predicting missing hyperlinks or linking a mention to a correct entity that is more speciﬁc than the gold label.
Researcher Affiliation	Academia	Wenzheng Zhang, Wenyue Hua, Karl Stratos Department of Computer Science Rutgers University {wenzheng.zhang,wenyue.hua,karl.stratos}@rutgers.edu
Pseudocode	No	The paper does not include pseudocode or clearly labeled algorithm blocks. The methods are described through prose and mathematical equations.
Open Source Code	Yes	Code available at: https://github.com/Wenzheng Zhang/Ent QA
Open Datasets	Yes	We follow the established practice and report the In KB Micro F1 score on the in-domain and out-of-domain datasets used in De Cao et al. (2021). Speciﬁcally, we use the AIDA-Co NLL dataset (Hoffart et al., 2011) as the in-domain dataset... For the KB, we use the 2019 Wikipedia dump provided in the KILT benchmark (Petroni et al., 2021), which contains 5.9 million entities.
Dataset Splits	Yes	Speciﬁcally, we use the AIDA-Co NLL dataset (Hoffart et al., 2011) as the in-domain dataset: we train Ent QA on the training portion of AIDA, use the validation portion (AIDA-A) for development, and reserve the test portion (AIDAB) for in-domain test performance.
Hardware Specification	Yes	The retriever is trained on 4 GPUs (A100) for 9 hours; the reader is trained on 2 GPUs for 6 hours.
Software Dependencies	No	The paper mentions several software components, models, and frameworks (e.g., BLINK, ELECTRA-large, SQuAD 2.0, Faiss, Adam, BERT, BART), but it does not specify exact version numbers for any of these to ensure reproducibility.
Experiment Setup	Yes	We break up each document x X into overlapping passages of length L = 32 with stride S = 16 under Word Piece tokenization... We use 64 candidate entities in training for both the retriever and the reader; we use 100 candidates at test time. We predict up to P = 3 mention spans for each candidate entity. We use γ = 0.05 as the threshold... For optimization, we use Adam (Kingma & Ba, 2015) with learning rate 2e-6 for the retriever and 1e-5 for the reader; we use a linear learning rate decay schedule with warmup proportion 0.06 over 4 epochs for both modules. The batch size is 4 for the retriever and 2 for the reader.