CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization

Authors: Zi Yang, Ziyue Liu, Samridhi Choudhary, Xinfeng Xie, Cao Gao, Siegfried Kunzmann, Zheng Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method also shows 2 speedup than standard pre-training on a BERT-like code-generation LLM while achieving 4.23 compression ratio in pre-training. An implementation of Co MERA is available at https://github.com/ziyangjoy/Co MERA.In this section, we test the performance of Co MERA on a few benchmarks, including a domainspecific LLM. Our experiments are run on a Nvidia RTX 3090 GPU with 24GB RAM.
Researcher Affiliation Collaboration Zi Yang University at Albany, SUNY zyang8@albany.edu Ziyue Liu University of California at Santa Barbara ziyueliu@ucsb.edu Samridhi Choudhary Amazon Alexa AI samridhc@amazon.com Xinfeng Xie Meta xinfeng@meta.com Cao Gao Meta caogao@meta.com Siegfried Kunzmann Amazon Alexa AI kunzman@amazon.com Zheng Zhang University of California at Santa Barbara zhengzhang@ece.ucsb.edu
Pseudocode Yes A.3 Algorithm for Late Stage Optimization in Section 3.2 The algorithm for the late stage optimization in Section 3.2 is summarized in Algorithm 1.A.5 Algorithm for Contraction Path in Section 4.2 The empirical near-optimal contraction path for tensor-compressed training is shown in Algorithm 2.
Open Source Code Yes An implementation of Co MERA is available at https://github.com/ziyangjoy/Co MERA.
Open Datasets Yes We train this model on the MNLI dataset [38] with the maximum sequence length 128 and compare the accuracy, resulting model size, and training time of Co MERA with the standard uncompressed training.We further test Co MERA on DLRM [32] released by Meta on Criteo Ad Kaggle dataset [23].The pre-training dataset is the Code Search Net [21], a collection of 2M (comment, code) pairs and 6M pure code sequences from open-source libraries with 6 types of programming languages.
Dataset Splits No The paper mentions 'validation accuracy' and shows 'validation total size (MB)' in Table 1 and Figure 6, indicating the use of a validation set. However, it does not provide specific details on how the dataset was split into training, validation, and test sets (e.g., percentages or sample counts), which is necessary for reproducibility.
Hardware Specification Yes Our experiments are run on a Nvidia RTX 3090 GPU with 24GB RAM.We evaluate the mixed-precision forward and backward propagations of Co MERA in a FP8 precision on the NVIDIA L4 GPU.
Software Dependencies No The paper mentions software components like 'Py Torch einsum' and 'CUDA Graph' and 'FP8 precision' but does not specify their version numbers, which is necessary for reproducible software dependencies.
Experiment Setup Yes We train this model on the MNLI dataset [38] with the maximum sequence length 128 and compare the accuracy, resulting model size, and training time of Co MERA with the standard uncompressed training.Table 1: Result of Transformer on MNLI of batch size 128.Table 4: Tensorized setting for the Transformer model in Co MERA.The model is trained for two epochs.