Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
TrAct: Making First-layer Pre-Activations Trainable
Authors: Felix Petersen, Christian Borgelt, Stefano Ermon
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find that Tr Act (Training Activations) speeds up training by factors between 1.25 and 4 while requiring only a small computational overhead. We demonstrate the utility of Tr Act with different optimizers for a range of different vision models including convolutional and transformer architectures. In a wide range of experiments, we demonstrate the utility of the proposed approach, effectively speeding up training by factors ranging from 1.25 to 4. |
| Researcher Affiliation | Academia | Felix Petersen Stanford University EMAIL Christian Borgelt University of Salzburg EMAIL Stefano Ermon Stanford University EMAIL |
| Pseudocode | Yes | def backward(grad_z, x, W, l=0.1): b, n = x.shape grad_W = grad_z.T @ x @ inverse( x.T @ x / b + l * eye(n)) return grad_W Figure 2: Implementation of Tr Act, where l corresponds to the hyperparameter λ. |
| Open Source Code | Yes | The code is publicly available at github.com/Felix-Petersen/tract. |
| Open Datasets | Yes | For the evaluation on the CIFAR-10 data set [22], we consider the Res Net-18 [23] as well as a small Vi T model. For CIFAR-100, we consider two experimental settings. Finally, we consider training on the Image Net data set [40]. We fine-tune the Vi T-S (800 epoch pre-training) model on the data sets CIFAR-10 and CIFAR-100 [22], Flowers-102 [41], and Stanford Cars [42]. |
| Dataset Splits | Yes | For the evaluation on the CIFAR-10 data set [22], we consider the Res Net-18 [23] as well as a small Vi T model. We perform training for 100, 200, 400, and 800 epochs. For the Res Net models, we use the Adam and SGD with momentum (0.9) optimizers, both with cosine learning rate schedules; learning rates, due to their significance, will be discussed alongside respective experiments. Further, we use the standard softmax cross-entropy loss. For the Vi T, we use Adam with a cosine learning rate scheduler as well as a softmax cross-entropy loss with label smoothing (0.1). The selected Vi T1 is particularly designed for effective training on CIFAR scales and has 7 layers, 12 heads, and hidden sizes of 384. Each model is trained with a batch size of 128 on an Nvidia RTX 4090 GPU with Py Torch [19]. |
| Hardware Specification | Yes | Each model is trained with a batch size of 128 on an Nvidia RTX 4090 GPU with Py Torch [19]. Each Res Net model is trained with a batch size of 256 on a single NVIDIA RTX 4090 GPU. We train the Vi T-S models on 4 NVIDIA A40 GPUs and the Vi T-B models on 8 NVIDIA V100 (32GB) GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch [19]' but does not provide a specific version number. It also mentions 'JAX [20]' but without a version. |
| Experiment Setup | Yes | For the evaluation on the CIFAR-10 data set [22], we consider the Res Net-18 [23] as well as a small Vi T model. We perform training for 100, 200, 400, and 800 epochs. For the Res Net models, we use the Adam and SGD with momentum (0.9) optimizers, both with cosine learning rate schedules; learning rates, due to their significance, will be discussed alongside respective experiments. Further, we use the standard softmax cross-entropy loss. For the Vi T, we use Adam with a cosine learning rate scheduler as well as a softmax cross-entropy loss with label smoothing (0.1). The selected Vi T1 is particularly designed for effective training on CIFAR scales and has 7 layers, 12 heads, and hidden sizes of 384. Each model is trained with a batch size of 128 on an Nvidia RTX 4090 GPU with Py Torch [19]. |