ReLoRA: High-Rank Training Through Low-Rank Updates
Authors: Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, Anna Rumshisky
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Re Lo RA on transformer language models up to 1.3B parameters. Furthermore, Re Lo RA achieves similar performance to full-rank training in both upstream and downstream tasks (Table 3). Our results are presented in Table 2 and Figure 1. We conduct ablation studies on all four crucial components of Re Lo RA: restarts, jagged schedule, optimizer resets, and warm starts, utilizing the 130M-sized model. The results are presented in Table 6. |
| Researcher Affiliation | Collaboration | University of Massachusetts Lowell Eleuther AI Amazon |
| Pseudocode | Yes | Algorithm 1 Re Lo RA. θ is model parameters, ˆθ is model parameters with linear layers replaced with Re Lo RA, M and V are Adam optimizer states, η is learning rate, and q is the reinit frequency. |
| Open Source Code | Yes | Our code is available on Git Hub1. 1github.com/guitaricet/relora |
| Open Datasets | Yes | To evaluate the effectiveness of Re Lo RA, we apply it to train a transformer language model on the C4 dataset (Raffel et al., 2020) |
| Dataset Splits | No | The paper uses the C4 dataset and specifies data amounts in tokens for training (e.g., '1.2B', '2.6B', '6.8B', '23.1B' tokens), but it does not provide explicit train/validation/test dataset splits (percentages or counts) or reference a standard splitting methodology required for reproduction. |
| Hardware Specification | Yes | Overall, in the 8x A100 setup, combining the warm start and Re Lo RA training time, 1.3B-Re Lo RA took 86 hours (wall clock) to train compared to 93.5 hours to train 1.3 model full-rank on the same amount of data. We additionally observed that Re Lo RA speedup is significantly hardware-dependent (Table 7). In our early experiments on 2x RTX3090... In a more practical, but still relatively budget setup of 6x A6000 Ada... |
| Software Dependencies | No | We use bfloat16 for all floating point operations and Flash Attention (Dao et al., 2022) for effective attention computation. Unlike plain stochastic gradient descent, Adam (Kingma and Ba, 2015) update is guided mainly by the first and second moments of the gradient accumulated over the previous steps. |
| Experiment Setup | Yes | Architecture and training hyperparameters Our architecture is based on transformer (Vaswani et al., 2017) and closely resembles LLa MA (Touvron et al., 2023). Namely, we use pre-normalization, RMSNorm (Zhang and Sennrich, 2019), Swi GLU activations (Shazeer, 2020), 8 3h fully-connected hidden state size (Touvron et al., 2023), and rotary embeddings (Su et al., 2021). We select the number of pre-training tokens based on the Chinchilla scaling laws (Hoffmann et al., 2022). Architecture and training hyperparameters are presented in Table 1. For all Lo RA and Re Lo RA experiments, we use rank r = 128 as our initial experiments showed it to have the best perplexity/memory trade-off. We initialize Re Lo RA from a checkpoint of full-rank training at 5,000 update steps and reset it every 5,000 steps thereafter, 3 times in total till we reach 20K steps. After each reset, 99% of the optimizer state is pruned based on magnitude, and the loss is warmed up for the next 100 iterations. |