Geometry-aware training of factorized layers in tensor Tucker format

Authors: Emanuele Zangrando, Steffen Schotthöfer, Gianluca Ceruti, Jonas Kusch, Francesco Tudisco

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The method s performance is further illustrated through a variety of experiments, showing remarkable training compression rates and comparable or even better performance than the full baseline and alternative layer factorization strategies.4 Experiments In the following, we conduct a series of experiments to evaluate the performance of the proposed method as compared to the full model and to standard layer factorization and model pruning baselines.
Researcher Affiliation Academia Emanuele Zangrando School of Mathematics, Gran Sasso Science Institute, L Aquila, Italy emanuele.zangrando@gssi.it Steffen Schotthöfer Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA schotthoefers@ornl.gov Gianluca Ceruti Department of Mathematics, University of Innsbruck, Innsbruck, Austria gianluca.ceruti@uibk.ac.at Jonas Kusch Department of Data Science, Norwegian University of Life Sciences, Ås, Norway jonas.kusch@nmbu.no Francesco Tudisco School of Mathematics and Maxwell Institute, University of Edinburgh, Edinburgh, UK; School of Mathematics, Gran Sasso Science Institute, L Aquila, Italy f.tudisco@ed.ac.uk
Pseudocode Yes Algorithm 1: TDLRT: Efficient Tensor Dynamical Low-Rank Training in Tucker format.Algorithm 2: TDLRT: Standard Dynamical Low-Rank Training of convolutions in Tucker format.
Open Source Code Yes The code is available in the supplementary material.
Open Datasets Yes The compression performance of TDLRT is evaluated on CIFAR10 and tiny-imagenet.we show in Figure 2 the accuracy history of Le Net5 on MNIST using TDLRT as compared to standard training on Tucker and CP decompositions.
Dataset Splits Yes All methods are trained using a batch size of 128 for 70 epochs each, as done in [79, 36]. All the baseline methods are trained with the SGD optimizer; the starting learning rate of 0.05 is reduced by a factor of 10 on plateaus, and momentum is chosen as 0.1 for all layers.
Hardware Specification Yes The experiments are performed on an Nvidia RTX3090, Nvidia RTX3070 and one Nvidia A100 80GB.
Software Dependencies No No specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow) are mentioned, only the type of optimizer (SGD).
Experiment Setup Yes All methods are trained using a batch size of 128 for 70 epochs each, as done in [79, 36]. All the baseline methods are trained with the SGD optimizer; the starting learning rate of 0.05 is reduced by a factor of 10 on plateaus, and momentum is chosen as 0.1 for all layers.