CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization
Authors: Zi Yang, Ziyue Liu, Samridhi Choudhary, Xinfeng Xie, Cao Gao, Siegfried Kunzmann, Zheng Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method also shows 2 speedup than standard pre-training on a BERT-like code-generation LLM while achieving 4.23 compression ratio in pre-training. An implementation of Co MERA is available at https://github.com/ziyangjoy/Co MERA.In this section, we test the performance of Co MERA on a few benchmarks, including a domainspecific LLM. Our experiments are run on a Nvidia RTX 3090 GPU with 24GB RAM. |
| Researcher Affiliation | Collaboration | Zi Yang University at Albany, SUNY zyang8@albany.edu Ziyue Liu University of California at Santa Barbara ziyueliu@ucsb.edu Samridhi Choudhary Amazon Alexa AI samridhc@amazon.com Xinfeng Xie Meta xinfeng@meta.com Cao Gao Meta caogao@meta.com Siegfried Kunzmann Amazon Alexa AI kunzman@amazon.com Zheng Zhang University of California at Santa Barbara zhengzhang@ece.ucsb.edu |
| Pseudocode | Yes | A.3 Algorithm for Late Stage Optimization in Section 3.2 The algorithm for the late stage optimization in Section 3.2 is summarized in Algorithm 1.A.5 Algorithm for Contraction Path in Section 4.2 The empirical near-optimal contraction path for tensor-compressed training is shown in Algorithm 2. |
| Open Source Code | Yes | An implementation of Co MERA is available at https://github.com/ziyangjoy/Co MERA. |
| Open Datasets | Yes | We train this model on the MNLI dataset [38] with the maximum sequence length 128 and compare the accuracy, resulting model size, and training time of Co MERA with the standard uncompressed training.We further test Co MERA on DLRM [32] released by Meta on Criteo Ad Kaggle dataset [23].The pre-training dataset is the Code Search Net [21], a collection of 2M (comment, code) pairs and 6M pure code sequences from open-source libraries with 6 types of programming languages. |
| Dataset Splits | No | The paper mentions 'validation accuracy' and shows 'validation total size (MB)' in Table 1 and Figure 6, indicating the use of a validation set. However, it does not provide specific details on how the dataset was split into training, validation, and test sets (e.g., percentages or sample counts), which is necessary for reproducibility. |
| Hardware Specification | Yes | Our experiments are run on a Nvidia RTX 3090 GPU with 24GB RAM.We evaluate the mixed-precision forward and backward propagations of Co MERA in a FP8 precision on the NVIDIA L4 GPU. |
| Software Dependencies | No | The paper mentions software components like 'Py Torch einsum' and 'CUDA Graph' and 'FP8 precision' but does not specify their version numbers, which is necessary for reproducible software dependencies. |
| Experiment Setup | Yes | We train this model on the MNLI dataset [38] with the maximum sequence length 128 and compare the accuracy, resulting model size, and training time of Co MERA with the standard uncompressed training.Table 1: Result of Transformer on MNLI of batch size 128.Table 4: Tensorized setting for the Transformer model in Co MERA.The model is trained for two epochs. |