The Surprising Computational Power of Nondeterministic Stack RNNs
Authors: Brian DuSell, David Chiang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show empirically that the RNS-RNN can model some non-CFLs; in fact it is the only stack RNN able to learn {w#w | w {0, 1} }", "We demonstrate perplexity improvements with this new model on the Penn Treebank language modeling benchmark.", "We show the cross-entropy difference on the validation and test sets in Fig. 2.", "We evaluate models using cross-entropy difference as in Section 4. |
| Researcher Affiliation | Academia | Brian Du Sell and David Chiang Department of Computer Science and Engineering University of Notre Dame {bdusell1,dchiang}@nd.edu |
| Pseudocode | No | The paper describes algorithmic computations using mathematical equations and prose (e.g., "The algorithm uses a tensor of weights called the stack WFA..."), but it does not include a dedicated pseudocode block or algorithm listing. |
| Open Source Code | Yes | Our code is publicly available.1 1https://github.com/bdusell/nondeterministic-stack-rnn |
| Open Datasets | Yes | We report perplexity on the Penn Treebank as preprocessed by Mikolov et al. (2011). |
| Dataset Splits | Yes | Before each training run, we sampled a training set of 10,000 examples and a validation set of 1,000 examples from p L." and "We used the standard train/validation/test splits for the Penn Treebank. |
| Hardware Specification | No | The paper mentions "the Center for Research Computing at the University of Notre Dame for providing the computing infrastructure for our experiments", but it does not provide specific hardware details such as GPU or CPU models. |
| Software Dependencies | No | The paper states, "Our code includes the original Docker image definition we used," but it does not explicitly list specific software dependencies with version numbers within the main text or appendices. |
| Experiment Setup | Yes | We trained each model by minimizing its cross-entropy (summed over the timestep dimension of each batch) on the training set, and we used per-symbol cross-entropy on the validation set as the early stopping criterion. We optimized the parameters of the model with Adam. For each training run, we randomly sampled the initial learning rate from a log-uniform distribution over [5 10 4, 1 10 2], and we used a gradient clipping threshold of 5. We initialized all fully-connected layers except for those in the LSTM controller with Xavier uniform initialization, and all other parameters uniformly from [ 0.1, 0.1]. We used mini-batches of size 10; each batch always contained examples of equal lengths. We randomly shuffled batches before each epoch. We multiplied the learning rate by 0.9 after 5 epochs of no improvement on the validation set, and we stopped early after 10 epochs of no improvement. |