Deep equilibrium networks are sensitive to initialization statistics
Authors: Atish Agarwala, Samuel S Schoenholz
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We begin by training a fully-connected DEQ on MNIST (Le Cun et al., 1989). ... We next examine the effects of the matrix ensembles on a DEQ using a vanilla transformer layer from (Al-Rfou et al., 2019) as the base of the DEQ layer, trained on Wikitext-103 (Merity et al., 2016). |
| Researcher Affiliation | Industry | 1Google Research, Brain Team. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to basing its experiments on 'a Haiku implementation of a DEQ transformer (Khan, 2020)', but does not explicitly state that the code specific to the methodology described in this paper is open-source or provide a link to it. |
| Open Datasets | Yes | We begin by training a fully-connected DEQ on MNIST (Le Cun et al., 1989). ... trained on Wikitext-103 (Merity et al., 2016). |
| Dataset Splits | No | The paper mentions training on datasets and evaluating test error/loss, but does not explicitly provide details about train/validation/test dataset splits or cross-validation setup. |
| Hardware Specification | Yes | We trained on TPUv3. |
| Software Dependencies | No | The paper mentions using 'Haiku implementation' and 'sentencepiece tokenizer' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Learning rate tuning of an ADAM optimizer with momentum of 0.9 suggested an optimal learning rate of 10−2 for all the conditions studied. We trained on Wikitext-103 (Merity et al., 2016) with a batch size of 512 and a context length of 128. We ran for 20 steps of the Broyden solver. For the experiments with multiple seeds, we used a learning rate of 10−3 with a linear warmup for 2×10^3 steps, followed by a cosine learning rate decay for 5×10^4 steps. |