Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Authors: Samuel Horváth, Stefanos Laskaridis, Shashank Rajput, Hongyi Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applied to DNNs, MAESTRO enables the extraction of lower footprint models that preserve performance. We validate these claims in Sec. 5.2 and 5.5, respectively. We start this section by describing the setup of our experiments, including the models, datasets and baselines with which we compare MAESTRO. |
| Researcher Affiliation | Collaboration | 1Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE 2Brave Software, London, UK 3Data Bricks, San Fransisco, USA 4Carnegie Mellon University, Pittsburgh, USA. |
| Pseudocode | Yes | Algorithm 1: MAESTRO (Training Process); Algorithm 2: MAESTRO (Hyper-parameter optimization) |
| Open Source Code | Yes | 1The implementation can be found here: https://github.com/Samuel Horvath/Maestro-Lo D |
| Open Datasets | Yes | MNIST. The MNIST dataset (Le Cun et al., 2010) is a database of 28 28 greyscale handwritten digits, with a training set of 60k examples and a test set of 10k samples. CIFAR-10. The CIFAR10 dataset (Krizhevsky et al., 2009) is a computer vision dataset that consists of 32 32 RGB images classified into 10 labels. It is split into 50k training images and 10k test images which are balanced across labels. Image Net-1k. The Image Net dataset (ILSVRC) (Deng et al., 2009) is an image classification challenge. The task comprises to classify an 300 300 RGB image among 1000 classes. In total there are 1.2M training samples and 50k test images. WMT16. The WMT dataset from statmt is machine translation dataset, spanning news commentaries and parliament proceedings, that aims to investigate the applicability of machine translation techniques when translating between language pairs. Specifically, we focus on the task of German-English language translation of image descriptions, commonly referred to as Multi30k (Elliott et al., 2016). |
| Dataset Splits | No | The paper provides train and test set sizes for MNIST, CIFAR-10, and ImageNet, but does not explicitly state percentages or counts for a separate validation split for reproduction. For the Transformer model, while perplexity implies validation, the specific split details are not provided. |
| Hardware Specification | Yes | We have implemented our solution in Py Torch (Paszke et al., 2017)(v1.13.0) trained our models on NVidia A100 (40G) GPUs. |
| Software Dependencies | Yes | We have implemented our solution in Py Torch (Paszke et al., 2017)(v1.13.0) trained our models on NVidia A100 (40G) GPUs. |
| Experiment Setup | Yes | Le Net. We use a standard configuration that is commonly employed for training Le Net models a step size of 0.01, a momentum of 0.9, and no weight decay. We train for a total of 20 epochs. VGG and Res Net-18. Similarly, we use a standard configuration that is commonly employed for training VGG and Res Net-18 models a step size of 0.01, a momentum of 0.9, weight decay of 1e 4, and a learning schedule with step size reductions by a factor of 10 at epochs 150 and 250. We train for a total of 300 epochs. Res Net-50. Similarly, we use a standard configuration that is commonly employed for training Res Net-50 models a step size of 0.01, a momentum of 0.9, weight decay of 1e 4, and a learning schedule with step size reductions by a factor of 10 at epochs 30 and 60. We train for a total of 90 epochs. Transformers. For the Transformer model, we use the Adam optimizer with an initial learning rate at 0.001, βs = (0.9, 0.98), ε = 10 8 batch size at 256. We also conduct gradient norm clipping with norm bound at 0.25. The entire training takes 400 epochs. For the vanilla warm-up training, we use warm-up epoch Ewu = 10. We enable label smoothing, weight sharing for the source and target word embedding, and weight sharing between target word embedding and the last dense layer. The learning rate schedule follows directly from the one proposed (Vaswani et al., 2017). |