A Linear Dynamical System Model for Text
Authors: David Belanger, Sham Kakade
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we employ our inferred word embeddings as features in standard tagging tasks, obtaining significant accuracy improvements. |
| Researcher Affiliation | Collaboration | David Belanger BELANGER@CS.UMASS.EDU College of Information and Computer Sciences, University of Massachusetts Amherst Sham Kakade SKAKADE@MICROSOFT.COM Microsoft Research |
| Pseudocode | Yes | Algorithm 1 Learning an LDS for Text |
| Open Source Code | No | We will release the code of our implementation. SSID requires simple scripting on top of a sparse linear algebra library. Our EM implementation consists of small modifications to Martens public ASOS code. |
| Open Datasets | Yes | We fit our LDS using a combination of the APNews, New York Times, and RCV1 newswire corpora, about 1B tokens total. [...] We train the tagging model on the Penn Treebank (PTB) train set, which is not included for LDS training. |
| Dataset Splits | Yes | The LDS hyperparameters were selected by maximizing the accuracy of a local classifier on the PTB dev set. |
| Hardware Specification | Yes | Overall, we found that the LDS and Word2Vec took about 12 hours to train on a single-core CPU. [...] The time to train the LDS, about 30 minutes, is inconsequential compared to training the RNN (4 days) on a single CPU core. |
| Software Dependencies | No | The paper mentions using a "sparse linear algebra library" and "Martens public ASOS code" but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We employ r = 4 for SSID, r = 7 for EM, and h = 200. We add 1000 psuedocounts for each type, by adding 1000 T to each coordinate of µ. [...] Our local classifier was a two-layer neural network with 25 hidden units, which outperformed a linear classifier. The best Word2Vec configuration used the CBOW architecture with a window width of 3. [...] This initializes parameters randomly, with lengthscales tuned as in Mikolov (2012). [...] We tuned the initial value and decay rate. |