A Tensor Decomposition Perspective on Second-order RNNs

Authors: Maude Lizaire, Michael Rizvi-Martel, Marawan Gamal, Guillaume Rabusseau

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We support these results empirically with experiments on the Penn Treebank dataset which demonstrate that, with a fixed parameter budget, CPRNNs outperforms RNNs, 2RNNs, and MIRNNs with the right choice of rank and hidden size.
Researcher Affiliation Academia 1Mila & DIRO, Universit e de Montr eal, Montreal, Canada 2CIFAR AI Chair. Correspondence to: Maude Lizaire <maude.lizaire@umontreal.ca>, Guillaume Rabusseau <grabus@iro.umontreal.ca>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes *Code base for this paper can be found at https:// github.com/Maude Liz/cprnn
Open Datasets Yes We perform experiments on the Penn Treebank dataset (Marcus et al., 1993) measuring bits-per-character (BPC) using the same train/valid/test partition as in Mikolov et al. (2012).
Dataset Splits Yes We perform experiments on the Penn Treebank dataset (Marcus et al., 1993) measuring bits-per-character (BPC) using the same train/valid/test partition as in Mikolov et al. (2012).
Hardware Specification No The paper mentions 'material support from NVIDIA Corporation in the form of computational resources' but does not specify the exact hardware (e.g., specific GPU or CPU models, memory details) used for the experiments.
Software Dependencies No The paper mentions using the Adam optimizer and tanh activation function, but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes All models were trained using truncated back propagation through time (Werbos, 1990) with sequence length of 50, batch size of 128 and using the Adam optimizer (P. Kingma and Ba, 2015) to minimize the negative log likelihood. Initial weights were drawn from a uniform random distribution U[ 1 n, 1 n]. For all experiments, we use the tanh activation function. For training, we use early stopping and a scheduler to reduce the learning rate (initialized at 0.001) by half on plateaus of the validation loss.