TrAct: Making First-layer Pre-Activations Trainable
Authors: Felix Petersen, Christian Borgelt, Stefano Ermon
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find that Tr Act (Training Activations) speeds up training by factors between 1.25 and 4 while requiring only a small computational overhead. We demonstrate the utility of Tr Act with different optimizers for a range of different vision models including convolutional and transformer architectures. In a wide range of experiments, we demonstrate the utility of the proposed approach, effectively speeding up training by factors ranging from 1.25 to 4. |
| Researcher Affiliation | Academia | Felix Petersen Stanford University mail@felix-petersen.de Christian Borgelt University of Salzburg christian@borgelt.net Stefano Ermon Stanford University ermon@cs.stanford.edu |
| Pseudocode | Yes | def backward(grad_z, x, W, l=0.1): b, n = x.shape grad_W = grad_z.T @ x @ inverse( x.T @ x / b + l * eye(n)) return grad_W Figure 2: Implementation of Tr Act, where l corresponds to the hyperparameter λ. |
| Open Source Code | Yes | The code is publicly available at github.com/Felix-Petersen/tract. |
| Open Datasets | Yes | For the evaluation on the CIFAR-10 data set [22], we consider the Res Net-18 [23] as well as a small Vi T model. For CIFAR-100, we consider two experimental settings. Finally, we consider training on the Image Net data set [40]. We fine-tune the Vi T-S (800 epoch pre-training) model on the data sets CIFAR-10 and CIFAR-100 [22], Flowers-102 [41], and Stanford Cars [42]. |
| Dataset Splits | Yes | For the evaluation on the CIFAR-10 data set [22], we consider the Res Net-18 [23] as well as a small Vi T model. We perform training for 100, 200, 400, and 800 epochs. For the Res Net models, we use the Adam and SGD with momentum (0.9) optimizers, both with cosine learning rate schedules; learning rates, due to their significance, will be discussed alongside respective experiments. Further, we use the standard softmax cross-entropy loss. For the Vi T, we use Adam with a cosine learning rate scheduler as well as a softmax cross-entropy loss with label smoothing (0.1). The selected Vi T1 is particularly designed for effective training on CIFAR scales and has 7 layers, 12 heads, and hidden sizes of 384. Each model is trained with a batch size of 128 on an Nvidia RTX 4090 GPU with Py Torch [19]. |
| Hardware Specification | Yes | Each model is trained with a batch size of 128 on an Nvidia RTX 4090 GPU with Py Torch [19]. Each Res Net model is trained with a batch size of 256 on a single NVIDIA RTX 4090 GPU. We train the Vi T-S models on 4 NVIDIA A40 GPUs and the Vi T-B models on 8 NVIDIA V100 (32GB) GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch [19]' but does not provide a specific version number. It also mentions 'JAX [20]' but without a version. |
| Experiment Setup | Yes | For the evaluation on the CIFAR-10 data set [22], we consider the Res Net-18 [23] as well as a small Vi T model. We perform training for 100, 200, 400, and 800 epochs. For the Res Net models, we use the Adam and SGD with momentum (0.9) optimizers, both with cosine learning rate schedules; learning rates, due to their significance, will be discussed alongside respective experiments. Further, we use the standard softmax cross-entropy loss. For the Vi T, we use Adam with a cosine learning rate scheduler as well as a softmax cross-entropy loss with label smoothing (0.1). The selected Vi T1 is particularly designed for effective training on CIFAR scales and has 7 layers, 12 heads, and hidden sizes of 384. Each model is trained with a batch size of 128 on an Nvidia RTX 4090 GPU with Py Torch [19]. |