Neural Deep Equilibrium Solvers

Authors: Shaojie Bai, Vladlen Koltun, J Zico Kolter

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that these neural equilibrium solvers are fast to train (only taking an extra 0.9-1.1% over the original DEQ s training time), require few additional parameters (1-3% of the original model size), yet lead to a 2 speedup in DEQ network inference without any degradation in accuracy across numerous domains and tasks.
Researcher Affiliation Collaboration Shaojie Bai Carnegie Mellon University Vladlen Koltun Apple J. Zico Kolter Carnegie Mellon University and Bosch Center for AI
Pseudocode Yes Algorithm 1 Anderson acceleration (AA) prototype (with parameter β and m)
Open Source Code Yes Code is available at https://github.com/locuslab/deq.
Open Datasets Yes To evaluate the the neural deep equilibrium solvers, we apply them on three largest-scale and highest-dimensional tasks the implicit models have ever been applied on, across the vision and language modalities. ... Wiki Text-103 language modeling (Merity et al., 2017), Image Net classification (Deng et al., 2009), and Cityscapes semantic segmentation with megapixel images (Cordts et al., 2016).
Dataset Splits Yes Wikitext-103 corpus contains over 103M words in its training split, and 218K/246K words for validation/test.
Hardware Specification Yes All of our experiments were conducted on NVIDIA RTX 2080 Ti GPUs.
Software Dependencies No The paper mentions using “Adam optimizer” and building upon “DEQ repo” and “MDEQ repo” but does not specify software versions for programming languages, libraries, or frameworks like PyTorch or CUDA.
Experiment Setup Yes Note that our approach only introduces minimal new hyperparameters (as the original DEQ model parameters are frozen). For the language modeling task, we use Adam optimizer (Kingma & Ba, 2015) with start learning rate 0.001 and cosine learning rate annealing (Loshchilov & Hutter, 2017). The neural solver is trained for 5000 steps, with sequences of length 60 and batch size 10, on top of a pretrained DEQ with word embedding dimension 700.