VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections
Authors: Roy Miles, Pradyumna Reddy, Ismail Elezi, Jiankang Deng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLo RA for fine-tuning LLa MA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset. 4 Comparison with the state-of-the-art |
| Researcher Affiliation | Industry | Roy Miles Pradyumna Reddy Ismail Elezi Jiankang Deng Huawei Noah s Ark Lab Corresponding authors: roy.miles@huawei.com, ismail.elezi@huawei.com |
| Pseudocode | Yes | Algorithm 1 Ve Lo RA, Pytorch-like |
| Open Source Code | Yes | Code: https://github.com/roymiles/Ve Lo RA |
| Open Datasets | Yes | VTAB-1k [53], GLUE [48], Alpaca dataset [36], C4 dataset [34] |
| Dataset Splits | No | The paper mentions using various benchmarks (VTAB-1k, GLUE, MMLU, C4) and reporting validation perplexity, but does not explicitly provide details about specific training, validation, and test dataset splits used for reproduction, nor does it cite standard splits being used. |
| Hardware Specification | Yes | All of the experiments in sections 4.2 and 4.5 were performed using 8 NVIDIA V100 GPUs with the fp16 data type. For the LLa MA experiments in section 4.4, we trained on 4 NVIDIA A100 GPUs |
| Software Dependencies | No | The paper mentions using PyTorch ("PyTorch-like pseudocode") and building upon other repositories (alpaca-lora, Ga Lore), but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We finetuned a Vi T-B [12] model pre-trained on Image Net-21K using the Adam W optimizer with a learning rate of 5e-4 and a weight decay of 1e-4. All our models were trained using the Adam W optimizer with a learning rate of 1e-3 and a weight decay of 0. Table 8: Hyperparameters of fine-tuning Ro BERTa base. |