A Linear Dynamical System Model for Text

Authors: David Belanger, Sham Kakade

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we employ our inferred word embeddings as features in standard tagging tasks, obtaining significant accuracy improvements.
Researcher Affiliation Collaboration David Belanger BELANGER@CS.UMASS.EDU College of Information and Computer Sciences, University of Massachusetts Amherst Sham Kakade SKAKADE@MICROSOFT.COM Microsoft Research
Pseudocode Yes Algorithm 1 Learning an LDS for Text
Open Source Code No We will release the code of our implementation. SSID requires simple scripting on top of a sparse linear algebra library. Our EM implementation consists of small modifications to Martens public ASOS code.
Open Datasets Yes We fit our LDS using a combination of the APNews, New York Times, and RCV1 newswire corpora, about 1B tokens total. [...] We train the tagging model on the Penn Treebank (PTB) train set, which is not included for LDS training.
Dataset Splits Yes The LDS hyperparameters were selected by maximizing the accuracy of a local classifier on the PTB dev set.
Hardware Specification Yes Overall, we found that the LDS and Word2Vec took about 12 hours to train on a single-core CPU. [...] The time to train the LDS, about 30 minutes, is inconsequential compared to training the RNN (4 days) on a single CPU core.
Software Dependencies No The paper mentions using a "sparse linear algebra library" and "Martens public ASOS code" but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We employ r = 4 for SSID, r = 7 for EM, and h = 200. We add 1000 psuedocounts for each type, by adding 1000 T to each coordinate of µ. [...] Our local classifier was a two-layer neural network with 25 hidden units, which outperformed a linear classifier. The best Word2Vec configuration used the CBOW architecture with a window width of 3. [...] This initializes parameters randomly, with lengthscales tuned as in Mikolov (2012). [...] We tuned the initial value and decay rate.