SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain
Authors: Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Rui Melo, Gabriel Hautreux, Etienne Malaboeuf, Johanne Charpentier, Dominic Culver, Michael Desa
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we present an empirical study on the scalability and domain adaptation of LLMs in the legal sector. |
| Researcher Affiliation | Collaboration | Pierre Colombo Equall MICS Centrale Supelec Telmo Pires Equall Malik Boudiaf Equall Rui Melo Dominic Culver Equall Etienne Malaboeuf CINES Gabriel Hautreux CINES Johanne Charpentier CINES Michael Desa |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We are releasing base, instruct, and aligned versions on top of Saul LM-54B and Saul LM-141B under the MIT License to facilitate reuse and collaborative research. Model will be made available at https://huggingface.co/. |
| Open Datasets | Yes | Our base corpus combines various legal datasets [53] with newly sourced public domain documents. It includes significant collections such as the Free Law subset and the Multi Legal Pile, augmented with extensive web-scraped content. Table 1 summarizes the composition and scale of our dataset. Instruction Sources Our methodology for sourcing general instructions involves the integration of a diverse array of datasets... Super Natural Instruction [83] and FLAN collection [46]... |
| Dataset Splits | No | The paper mentions training, validation, and test phases but does not explicitly provide the dataset splits (e.g., percentages or sample counts) for these phases. It relies on benchmarks for evaluation rather than defining explicit validation splits within its own data processing section. |
| Hardware Specification | Yes | The computational backbone for the continuous pretraining phase of our project consists of 384 AMD MI250 GPUs. We can reach 40% GPU utilization with our implementation. For instruction fine-tuning and preference optimization, we rely on 64 AMD MI250 GPUs. Evaluation protocols are executed on a single node of AMD MI250 GPU. For synthetic data generation, we used v LLM on a node of NVIDIAA100, primarily due to limited support of libraries on MI2504. Each node contains four MI250X GPUs, which have a theoretical Thermal Design Power (TDP) of 560W. |
| Software Dependencies | No | The paper mentions software components such as Py Torch, Deep Speed, Flash Attention, unicodedata Python package, Poppler, and Ken LM, but it does not provide specific version numbers for these components. |
| Experiment Setup | Yes | Continued Pretraining For continued pretraining, we use the Adam W [41, 47, 8] optimizer with hyperparameters β1 = 0.99, β2 = 0.90, and a learning rate of 2 10 5. We utilize a cross-entropy loss function to optimize model predictions. The training protocol sets gradient accumulation to 4, with tailored batch sizes of 8 for Saul LM-54B and 4 for Saul LM-141B, optimizing both GPU utilization and training efficiency. Instruction Fine-Tuning (IFT) IFT uses the Adam W optimizer (learning rate of 1 10 5), reinitialized to reset training states and maintain training stability. We limit this phase to a single training epoch, as our experiments suggest this maximizes performance gains. Preference Training Using DPO We adjust the learning rate of the Adam W optimizer to 1 10 6 during DPO. |