Stabilizing Equilibrium Models by Jacobian Regularization

Authors: Shaojie Bai, Vladlen Koltun, Zico Kolter

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the proposed regularization by experiments on both toy-scale synthetic tasks and large-scale real datasets across domains: word-level language modeling on Wiki Text-103 (Merity et al., 2017) and high-resolutional image classification on the full Image Net dataset (Deng et al., 2009).
Researcher Affiliation Collaboration 1Carnegie Mellon University, Pittsburgh PA, USA 2Intel Labs, USA. Correspondence to: Shaojie Bai <shaojieb@cs.cmu.edu>.
Pseudocode No The paper describes the model architecture and processes in prose and mathematical equations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code is available here.
Open Datasets Yes We validate the proposed regularization by experiments on both toy-scale synthetic tasks and large-scale real datasets across domains: word-level language modeling on Wiki Text-103 (Merity et al., 2017) and high-resolutional image classification on the full Image Net dataset (Deng et al., 2009).
Dataset Splits Yes We generated 5096 scalar data pairs (x, y) using function y = h(x) = 3 2x3 + x2 5x + 2 sin(x) 3 + δ (where δ N(0, 0.05)), and split them into 4096 and 1000 training and validation samples, respectively.
Hardware Specification No The paper states, 'The memory and speeds reported are benchmarked across different models on the same setting (e.g., same batch size, sequence length, number of steps, etc.) with the same GPU,' but does not specify the model or detailed specifications of the GPU or any other hardware.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes As we found the Jacoabian regularization could sometimes hurt performance (see Sec. 5.3), we only apply the proposed loss stochastically with a probability p, and gradually increase this p or the regularization strength γ (see Eq. (4)) over training steps. We also use cosine learning rate schedule (Loshchilov & Hutter, 2017) for all tasks, including the synthetic one.