ISAAC Newton: Input-based Approximate Curvature for Newton's Method

Authors: Felix Petersen, Tobias Sutter, Christian Borgelt, Dongsung Huh, Hilde Kuehne, Yuekai Sun, Oliver Deussen

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS In the previous section, we discussed the theoretical properties of the proposed update directions ζ and ζ with the aspect that ζ would actually be free to compute in the mini-batch regime. In this section, we provide empirical evidence that ζ is a good update direction, even in deep learning. Specifically, we demonstrate that (E1) ζ achieves similar performance to K-FAC, while being substantially cheaper to compute. (E2) The performance of our proposed method can be empirically maintained in the mini-batch regime (n b). (E3) ζ may be used for individual layers, while for other layers only the gradient is used. This still leads to improved performance. (E4) ζ also improves the performance for training larger models such as BERT and Res Net. (E5) The runtime and memory requirements of ζ are comparable to those of gradient descent.
Researcher Affiliation Collaboration Felix Petersen12, Tobias Sutter2, Christian Borgelt3, Dongsung Huh4, Hilde Kuehne45, Yuekai Sun6, Oliver Deussen2 1Stanford University, 2University of Konstanz, 3University of Salzburg, 4MIT-IBM Watson AI Lab, 5University of Frankfurt, 6University of Michigan
Pseudocode Yes A PYTORCH IMPLEMENTATION We display a Py Torch [26] implementation of ISAAC for a fully-connected layer below. Here, we mark the important part (i.e., the part beyond the boilerplate) with a red rectangle. import torch class ISAACLinear Function(torch.autograd.Function): @staticmethod def forward(ctx, input, weight, bias, la, inv_type): ctx.save_for_backward(input, weight, bias) ctx.la = la if inv_type == 'cholesky_inverse': ctx.inverse = torch.cholesky_inverse elif inv_type == 'inverse': ctx.inverse = torch.inverse else: raise Not Implemented Error(inv_type) return input @ weight.T + (bias if bias is not None else 0) @staticmethod def backward(ctx, grad_output): input, weight, bias = ctx.saved_tensors if ctx.needs_input_grad[0]: grad_0 = grad_output @ weight else: grad_0 = None if ctx.needs_input_grad[1]: aa T = input @ input.T / grad_output.shape[0] I_b = torch.eye(aa T.shape[0], device=aa T.device, dtype=aa T.dtype) aa T_Iaa T_inv = aa T @ ctx.inverse(aa T / ctx.la + I_b) grad_1 = grad_output.T @ ( I_b 1. / ctx.la * aa T_Iaa T_inv ) @ input else: grad_1 = None return ( grad_0, grad_1, grad_output.mean(0, keepdim=True) if bias is not None else None, None, None, None, ) class ISAACLinear(torch.nn.Linear): def __init__(self, in_features, out_features, la, inv_type='inverse', **kwargs): super(ISAACLinear, self).__init__( in_features=in_features, out_features=out_features, **kwargs ) self.la = la self.inv_type = inv_type def forward(self, input: torch.Tensor) -> torch.Tensor: return ISAACLinear Function.apply( input, self.weight, self.bias.unsqueeze(0) if self.bias is not None else None, self.la, self.inv_type )
Open Source Code No Code will be made available at github.com/Felix-Petersen/isaac
Open Datasets Yes Here, we train a 5-layer deep neural network on the MNIST classification task [16] with a batch size of 60 for a total of 40 epochs or 40 000 steps.
Dataset Splits No The paper states using the MNIST classification task, which has standard splits, but does not explicitly provide percentages or counts for training, validation, and test splits within the text. It only mentions batch size and epochs for training.
Hardware Specification Yes Table 2: Runtimes and memory requirements for different models. Runtime is the training time per epoch on MNIST at a batch size of 60, i.e., for 1 000 training steps. The K-FAC implementation is from the backpack library [15]. The GPU is an Nvidia A6000.
Software Dependencies No A PYTORCH IMPLEMENTATION We display a Py Torch [26] implementation of ISAAC for a fully-connected layer below.
Experiment Setup Yes Here, we train a 5-layer deep neural network on the MNIST classification task [16] with a batch size of 60 for a total of 40 epochs or 40 000 steps.