Deep Equilibrium Models

Authors: Shaojie Bai, J. Zico Kolter, Vladlen Koltun

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate DEQ on both synthetic stress tests and realistic large-scale language modeling (where complex long-term temporal dependencies are involved). We use the two aforementioned instantiations of fθ in DEQ. On both Wiki Text-103 [35] (which contains >100M words and a vocabulary size of >260K) and the smaller Penn Treebank corpus (where stronger regularizations are needed for conventional deep nets) for word-level language modeling, we show that DEQ achieves competitive (or better) performance even when compared to SOTA methods (of the same model size, both weighttied and not) while using significantly less memory.
Researcher Affiliation Collaboration Shaojie Bai Carnegie Mellon University J. Zico Kolter Carnegie Mellon University Bosch Center for AI Vladlen Koltun Intel Labs
Pseudocode No No structured pseudocode or algorithm blocks (e.g., labeled 'Pseudocode' or 'Algorithm') were found in the paper.
Open Source Code Yes The code is available at tt s t s q.
Open Datasets Yes On both Wiki Text-103 [35] (which contains >100M words and a vocabulary size of >260K) and the smaller Penn Treebank corpus... for word-level language modeling, we show that DEQ achieves competitive (or better) performance...
Dataset Splits No The paper uses standard datasets like Wiki Text-103 and Penn Treebank and refers to 'Training Epoch' and 'Validation Perplexity' in Figure 3, implying the use of a validation set. However, it does not explicitly provide specific dataset split information (e.g., percentages or sample counts for training, validation, and test sets) within the text.
Hardware Specification No The paper mentions training on 'GPUs' and 'TPUs' ('Transformer-XL (X-large, adaptive embed., on TPU)' in Table 3), but it does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions 'Py Torch [45]' as an example of an autograd package used for computations, but it does not provide specific version numbers for PyTorch or any other software dependencies needed to replicate the experiment.
Experiment Setup Yes During training, we set this tolerance ε of forward and backward passes to ε = T 10 8, respectively. At inference, we relax the tolerance to ε = T 10 2 (or we can use a smaller maximum iteration limit for Broyden s method; see discussions later). For the DEQ-Transformers, we employ the relative positional embedding [16], with sequences of length 150 at both training and inference on the Wiki Text-103 dataset. [...] We initialize the parameters of fθ by sampling from N(0, 0.05).