TrAct: Making First-layer Pre-Activations Trainable

Authors: Felix Petersen, Christian Borgelt, Stefano Ermon

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we find that Tr Act (Training Activations) speeds up training by factors between 1.25 and 4 while requiring only a small computational overhead. We demonstrate the utility of Tr Act with different optimizers for a range of different vision models including convolutional and transformer architectures. In a wide range of experiments, we demonstrate the utility of the proposed approach, effectively speeding up training by factors ranging from 1.25 to 4.
Researcher Affiliation Academia Felix Petersen Stanford University mail@felix-petersen.de Christian Borgelt University of Salzburg christian@borgelt.net Stefano Ermon Stanford University ermon@cs.stanford.edu
Pseudocode Yes def backward(grad_z, x, W, l=0.1): b, n = x.shape grad_W = grad_z.T @ x @ inverse( x.T @ x / b + l * eye(n)) return grad_W Figure 2: Implementation of Tr Act, where l corresponds to the hyperparameter λ.
Open Source Code Yes The code is publicly available at github.com/Felix-Petersen/tract.
Open Datasets Yes For the evaluation on the CIFAR-10 data set [22], we consider the Res Net-18 [23] as well as a small Vi T model. For CIFAR-100, we consider two experimental settings. Finally, we consider training on the Image Net data set [40]. We fine-tune the Vi T-S (800 epoch pre-training) model on the data sets CIFAR-10 and CIFAR-100 [22], Flowers-102 [41], and Stanford Cars [42].
Dataset Splits Yes For the evaluation on the CIFAR-10 data set [22], we consider the Res Net-18 [23] as well as a small Vi T model. We perform training for 100, 200, 400, and 800 epochs. For the Res Net models, we use the Adam and SGD with momentum (0.9) optimizers, both with cosine learning rate schedules; learning rates, due to their significance, will be discussed alongside respective experiments. Further, we use the standard softmax cross-entropy loss. For the Vi T, we use Adam with a cosine learning rate scheduler as well as a softmax cross-entropy loss with label smoothing (0.1). The selected Vi T1 is particularly designed for effective training on CIFAR scales and has 7 layers, 12 heads, and hidden sizes of 384. Each model is trained with a batch size of 128 on an Nvidia RTX 4090 GPU with Py Torch [19].
Hardware Specification Yes Each model is trained with a batch size of 128 on an Nvidia RTX 4090 GPU with Py Torch [19]. Each Res Net model is trained with a batch size of 256 on a single NVIDIA RTX 4090 GPU. We train the Vi T-S models on 4 NVIDIA A40 GPUs and the Vi T-B models on 8 NVIDIA V100 (32GB) GPUs.
Software Dependencies No The paper mentions 'Py Torch [19]' but does not provide a specific version number. It also mentions 'JAX [20]' but without a version.
Experiment Setup Yes For the evaluation on the CIFAR-10 data set [22], we consider the Res Net-18 [23] as well as a small Vi T model. We perform training for 100, 200, 400, and 800 epochs. For the Res Net models, we use the Adam and SGD with momentum (0.9) optimizers, both with cosine learning rate schedules; learning rates, due to their significance, will be discussed alongside respective experiments. Further, we use the standard softmax cross-entropy loss. For the Vi T, we use Adam with a cosine learning rate scheduler as well as a softmax cross-entropy loss with label smoothing (0.1). The selected Vi T1 is particularly designed for effective training on CIFAR scales and has 7 layers, 12 heads, and hidden sizes of 384. Each model is trained with a batch size of 128 on an Nvidia RTX 4090 GPU with Py Torch [19].