Neural Deep Equilibrium Solvers
Authors: Shaojie Bai, Vladlen Koltun, J Zico Kolter
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that these neural equilibrium solvers are fast to train (only taking an extra 0.9-1.1% over the original DEQ s training time), require few additional parameters (1-3% of the original model size), yet lead to a 2 speedup in DEQ network inference without any degradation in accuracy across numerous domains and tasks. |
| Researcher Affiliation | Collaboration | Shaojie Bai Carnegie Mellon University Vladlen Koltun Apple J. Zico Kolter Carnegie Mellon University and Bosch Center for AI |
| Pseudocode | Yes | Algorithm 1 Anderson acceleration (AA) prototype (with parameter β and m) |
| Open Source Code | Yes | Code is available at https://github.com/locuslab/deq. |
| Open Datasets | Yes | To evaluate the the neural deep equilibrium solvers, we apply them on three largest-scale and highest-dimensional tasks the implicit models have ever been applied on, across the vision and language modalities. ... Wiki Text-103 language modeling (Merity et al., 2017), Image Net classification (Deng et al., 2009), and Cityscapes semantic segmentation with megapixel images (Cordts et al., 2016). |
| Dataset Splits | Yes | Wikitext-103 corpus contains over 103M words in its training split, and 218K/246K words for validation/test. |
| Hardware Specification | Yes | All of our experiments were conducted on NVIDIA RTX 2080 Ti GPUs. |
| Software Dependencies | No | The paper mentions using “Adam optimizer” and building upon “DEQ repo” and “MDEQ repo” but does not specify software versions for programming languages, libraries, or frameworks like PyTorch or CUDA. |
| Experiment Setup | Yes | Note that our approach only introduces minimal new hyperparameters (as the original DEQ model parameters are frozen). For the language modeling task, we use Adam optimizer (Kingma & Ba, 2015) with start learning rate 0.001 and cosine learning rate annealing (Loshchilov & Hutter, 2017). The neural solver is trained for 5000 steps, with sequences of length 60 and batch size 10, on top of a pretrained DEQ with word embedding dimension 700. |