A Neural Probabilistic Model for Context Based Citation Recommendation

Authors: Wenyi Huang, Zhaohui Wu, Chen Liang, Prasenjit Mitra, C. Giles

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement and evaluate our model on the entire Cite Seer dataset, which at the time of this work consists of 10,760,318 citation contexts from 1,017,457 papers. We show that the proposed model significantly outperforms other stateof-the-art models in recall, MAP, MRR, and n DCG.
Researcher Affiliation Academia Wenyi Huang , Zhaohui Wu , Chen Liang , Prasenjit Mitra , C. Lee Giles Information Sciences and Technology, Computer Sciences and Engineering The Pennsylvania State University University Park, PA 16802
Pseudocode No The paper describes the model and training process using text and mathematical equations, and includes a neural network architecture diagram (Fig. 2), but no structured pseudocode or algorithm blocks are provided.
Open Source Code No The paper states: 'The model is implemented in Ref Seer2 (Huang et al. 2014) ,a citation recommendation engine, for public uses. 2http://refseer.ist.psu.edu/'. This link is to a system that implements the model, not an explicit statement of open-source code for the specific methodology described in this paper.
Open Datasets Yes A snapshot of Cite Seer paper and citation database was obtained at Oct. 2013. The dataset is split into two parts: (1) papers crawled before 2011 (included) as the training set and (2) papers crawled after 2011 as the testing set. Citations are extracted along with their citation contexts. One citation context consists of the sentence where a citation appears, as well as the sentences that appear before and after. As a result, the training set contains |C|=8,992,476 pairs of citation contexts and citations and the testing set contains 1,628,698 pairs. 1The dataset is publicly available at http://refseer.ist.psu.edu/ data/.
Dataset Splits No The paper specifies a train/test split with exact numbers ('training set contains |C|=8,992,476 pairs' and 'testing set contains 1,628,698 pairs'), but it does not mention a separate validation set or its specific split.
Hardware Specification No The paper does not specify any hardware details such as CPU/GPU models, memory, or specific cloud computing resources used for the experiments.
Software Dependencies No The paper mentions using 'GIZA++ toolkit' for a baseline, but does not provide specific version numbers for any software dependencies related to their own proposed model's implementation.
Experiment Setup Yes The dimension of the word and document representation vectors is set to n = 600. For negative sampling, we set the number of negative samples to k = 10. For noise-contrastive estimation, we set the number of noise samples to k = 1000. For window size used in learning word representation, we follow the word2vec paper by fixing M = 5.