reproducibilityindex.ai

Deep Equilibrium Models

Authors: Shaojie Bai, J. Zico Kolter, Vladlen Koltun

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DEQ on both synthetic stress tests and realistic large-scale language modeling (where complex long-term temporal dependencies are involved). We use the two aforementioned instantiations of fθ in DEQ. On both Wiki Text-103 [35] (which contains >100M words and a vocabulary size of >260K) and the smaller Penn Treebank corpus (where stronger regularizations are needed for conventional deep nets) for word-level language modeling, we show that DEQ achieves competitive (or better) performance even when compared to SOTA methods (of the same model size, both weighttied and not) while using signiﬁcantly less memory.
Researcher Affiliation	Collaboration	Shaojie Bai Carnegie Mellon University J. Zico Kolter Carnegie Mellon University Bosch Center for AI Vladlen Koltun Intel Labs
Pseudocode	No	No structured pseudocode or algorithm blocks (e.g., labeled 'Pseudocode' or 'Algorithm') were found in the paper.
Open Source Code	Yes	The code is available at tt s t s q.
Open Datasets	Yes	On both Wiki Text-103 [35] (which contains >100M words and a vocabulary size of >260K) and the smaller Penn Treebank corpus... for word-level language modeling, we show that DEQ achieves competitive (or better) performance...
Dataset Splits	No	The paper uses standard datasets like Wiki Text-103 and Penn Treebank and refers to 'Training Epoch' and 'Validation Perplexity' in Figure 3, implying the use of a validation set. However, it does not explicitly provide specific dataset split information (e.g., percentages or sample counts for training, validation, and test sets) within the text.
Hardware Specification	No	The paper mentions training on 'GPUs' and 'TPUs' ('Transformer-XL (X-large, adaptive embed., on TPU)' in Table 3), but it does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions 'Py Torch [45]' as an example of an autograd package used for computations, but it does not provide specific version numbers for PyTorch or any other software dependencies needed to replicate the experiment.
Experiment Setup	Yes	During training, we set this tolerance ε of forward and backward passes to ε = T 10 8, respectively. At inference, we relax the tolerance to ε = T 10 2 (or we can use a smaller maximum iteration limit for Broyden s method; see discussions later). For the DEQ-Transformers, we employ the relative positional embedding [16], with sequences of length 150 at both training and inference on the Wiki Text-103 dataset. [...] We initialize the parameters of fθ by sampling from N(0, 0.05).