Path Independent Equilibrium Models Can Better Exploit Test-Time Computation
Authors: Cem Anil, Ashwini Pokle, Kaiqu Liang, Johannes Treutlein, Yuhuai Wu, Shaojie Bai, J. Zico Kolter, Roger B. Grosse
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that a broad class of architectures named equilibrium models display strong upwards generalization, and find that stronger performance on harder examples (which require more iterations of inference to get correct) strongly correlates with the path independence of the system its tendency to converge to the same steady-state behaviour regardless of initialization, given enough computation. Experimental interventions made to promote path independence result in improved generalization on harder problem instances, while those that penalize it degrade this ability. Our results help explain why equilibrium models are capable of strong upwards generalization and motivates future work that harnesses path independence as a general modelling principle to facilitate scalable test-time usage. |
| Researcher Affiliation | Collaboration | 1University of Toronto and Vector Institute 2Carnegie Mellon University 3Princeton University 4University of California, Berkeley 5Stanford University and Google Research 6Bosch Center for AI |
| Pseudocode | Yes | Algorithm 1 Asymptotic Alignment Score |
| Open Source Code | No | Code will be released along with the paper (Footnote 2). This is a promise for future release, not a concrete statement of availability or a link. |
| Open Datasets | Yes | We focus on multiple algorithmic generalization tasks: prefix sum and mazes by Schwarzschild et al. [2021a,b], blurry MNIST, matrix inversion and edge copy by Du et al. [2022]. |
| Dataset Splits | No | We evaluated performance on a mixture of in- and OOD validation data; results on individual data splits can be found in the supplementary material. The main paper does not specify the exact percentages or sample counts for the training, validation, and test splits. |
| Hardware Specification | No | The paper does not specify any hardware details such as CPU or GPU models, or other computational resources used for running the experiments. |
| Software Dependencies | No | The paper mentions using "L-BFGS" optimizer, but it does not specify any software dependencies with version numbers, such as programming languages or libraries like PyTorch or TensorFlow. |
| Experiment Setup | Yes | On prefix sum experiments, we varied 1) network depth, 2) whether or not weight norm (wnorm) [Salimans and Kingma, 2016] was used or not,9 3) learning rate (one of [0.01, 0.001, 0.0001]), 4) forward solver (fixed point iterations or Anderson acceleration [Anderson, 1965], and 5) the gradient estimator (backprop or implicit gradients). On the maze experiments, we varied 1) network depth, 2) use of weight norm, 3) forward solver (fixed point iterations or Broyden solver [Broyden, 1965]), and 5) the gradient estimator (backprop or implicit gradients). |