A Tensor Decomposition Perspective on Second-order RNNs
Authors: Maude Lizaire, Michael Rizvi-Martel, Marawan Gamal, Guillaume Rabusseau
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We support these results empirically with experiments on the Penn Treebank dataset which demonstrate that, with a fixed parameter budget, CPRNNs outperforms RNNs, 2RNNs, and MIRNNs with the right choice of rank and hidden size. |
| Researcher Affiliation | Academia | 1Mila & DIRO, Universit e de Montr eal, Montreal, Canada 2CIFAR AI Chair. Correspondence to: Maude Lizaire <maude.lizaire@umontreal.ca>, Guillaume Rabusseau <grabus@iro.umontreal.ca>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | *Code base for this paper can be found at https:// github.com/Maude Liz/cprnn |
| Open Datasets | Yes | We perform experiments on the Penn Treebank dataset (Marcus et al., 1993) measuring bits-per-character (BPC) using the same train/valid/test partition as in Mikolov et al. (2012). |
| Dataset Splits | Yes | We perform experiments on the Penn Treebank dataset (Marcus et al., 1993) measuring bits-per-character (BPC) using the same train/valid/test partition as in Mikolov et al. (2012). |
| Hardware Specification | No | The paper mentions 'material support from NVIDIA Corporation in the form of computational resources' but does not specify the exact hardware (e.g., specific GPU or CPU models, memory details) used for the experiments. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and tanh activation function, but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | All models were trained using truncated back propagation through time (Werbos, 1990) with sequence length of 50, batch size of 128 and using the Adam optimizer (P. Kingma and Ba, 2015) to minimize the negative log likelihood. Initial weights were drawn from a uniform random distribution U[ 1 n, 1 n]. For all experiments, we use the tanh activation function. For training, we use early stopping and a scheduler to reduce the learning rate (initialized at 0.001) by half on plateaus of the validation loss. |