reproducibilityindex.ai

Capturing Semantically Meaningful Word Dependencies with an Admixture of Poisson MRFs

Authors: David I Inouye, Pradeep K Ravikumar, Inderjit S Dhillon

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide compelling quantitative and qualitative results on the BNC corpus that demonstrate the superiority of APM over previous topic models for identifying semantically meaningful word dependencies. We conducted two main timing experiments to show that the algorithm can be efﬁciently parallelized and the algorithm can scale to reasonably large datasets.
Researcher Affiliation	Academia	David I. Inouye Pradeep Ravikumar Inderjit S. Dhillon Department of Computer Science University of Texas at Austin {dinouye,pradeepr,inderjit}@cs.utexas.edu
Pseudocode	Yes	See Appendix D for Alg. 1: Alternating Optimization for APM and Alg. 2: Approximate Newton Direction for Component Parameters.
Open Source Code	Yes	(MATLAB code available at: http://bigdata.ices.utexas.edu/software/apm/)
Open Datasets	Yes	Human-Scored Evocation Dataset: The original human-scored evocation dataset was produced by a set of trained undergraduates...From the possible pairwise evaluations, approximately 10% of the word pairs were randomly selected to be manually scored...The second dataset was constructed by predicting the pairs of words...scored using Amazon MTurk (mturk.com)... BNC Corpus: Because the evocation dataset was based on the BNC corpus, we used the BNC corpus for our experiments.
Dataset Splits	Yes	We randomly split the human scores into a 50% tuning split and 50% testing split. Note that we have a tuning split rather than a training split because the model training algorithms are unsupervised (i.e. they never see the human scores) so the only supervision occurs in selecting the ﬁnal model parameters (i.e. during the tuning phase).
Hardware Specification	Yes	All timing experiments were conducted on the TACC Maverick system with Intel Xeon E5-2680 v2 Ivy Bridge CPUs (2.80 GHz), 20 CPUs per node, and 12.8 GB memory per CPU (https://www.tacc.utexas.edu/).
Software Dependencies	No	The paper mentions 'MATLAB code' and use of the 'nltk package' and 'MALLET' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	For our experiments as described in Sec. 4.1, we set α = 1 (i.e. a uniform prior on admixture weights) and β = {0, 1}. The best performing LDA model was trained with 50 topics, α = 1 and β = 0.0001. The best APM model was the APM-Low Reg model trained with only 5 topics and a small regularization parameter λ = 0.05.