SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

Authors: Sewon Min, Suchin Gururangan, Eric Wallace, Weijia Shi, Hannaneh Hajishirzi, Noah A. Smith, Luke Zettlemoyer

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that the parametric LM struggles on its own with domains not covered by OLC. However, access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile, a more diverse corpus with mostly high-risk text. We also analyze which nonparametric approach works best, where the remaining errors lie, and how performance scales with datastore size. Our results suggest that it is possible to build high quality language models while mitigating legal risk.
Researcher Affiliation Academia 1University of Washington 2UC Berkeley 3Allen Institute for AI {sewon,sg01,hannaneh,nasmith,lsz}@cs.washington.edu ericwallace@berkeley.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper references various existing open-source projects and datasets, but does not state that the code for their proposed model, SILO, is open-source or available.
Open Datasets Yes For the parametric component of SILO, we use a 1.3B decoder-only transformer LM (Vaswani et al., 2017) based on the LLa MA architecture (Touvron et al., 2023) as implemented in Open LM.2 This model uses a fixed set of parameters at both training and inference time. The parametric component of SILO should be learned on the lower-risk data, i.e., data in the public domain or under permissive licenses. To this end, we introduce the OPEN LICENSE CORPUS (OLC), a large collection of permissive textual datasets across multiple domains, comprising 228B tokens. It is essential to note that the permissivity of a license type is not a binary decision but rather a spectrum. Therefore, we establish three levels of permissive licenses: Public domain (pd), Permissively licensed software (sw) and Attribution licenses (by). This allows model developers to select the boundary that aligns with their preferences. See B.1 for the description of each license type. We train the models on varying subsets of licenses from pd and pdsw to pdbysw to accommodate different risk tolerances. Later sections demonstrate our empirical findings are consistent across all possible boundaries (4, D.2, Figure 3). pd by Legal: We curate legal text from the Pile of Law (Henderson et al., 2022), an amalgation of 31 different sources of text related to civil court cases, patents, and other legal and governmental works, either licensed as public domain or CC-BY. We also gather public domain text from the Case Law Access Project (Caselaw Access Project), which covers over 6.5 million decisions published by state and federal courts throughout U.S. history. sw Code: We use the Github subset of the Red Pajama dataset (Together, 2023), which contains code from Github repositories with three permissive software licenses: MIT, Apache, and BSD. sw by Conversation: We source conversational text under permissive software licenses from the Hacker News (MIT license) and the Ubuntu IRC (Apache license) subsets of the Pile (Gao et al., 2020). We also use the Stackexchange subset of the Red Pajama dataset (Together, 2023) and a Stackoverflow corpus from Kaggle,3 both under the CC-BY-SA license. sw Math: We source mathematical text from the Deepmind Mathematics (Saxton et al., 2019) and the AMPS (Hendrycks et al., 2021) datasets, both of which are under the Apache license. pd by Science: We source scientific text from Ar Xiv abstracts that are in the public domain (Ar Xiv, 2023). We also collect full-text articles from the Semantic Scholar Research Corpus (Lo et al., 2020, S2ORC), either licensed as public domain or CC-BY. pd Books: We source books from the Gutenberg corpus (Project Gutenberg), which are copyrightexpired books that are in the public domain. pd by News: We collect public domain news text from the English subset of the MOT corpus (Palen-Michel et al., 2022). We also collect text from Wikinews, which is under CC BY-SA. by Encyclopedic: Finally, we include a large set of Wikipedia from the subset included in Red Pajama (Together, 2023). We follow Red Pajama in using Wikipedia snapshots from 20 languages even though the model primarily focuses on English.
Dataset Splits Yes We first deduplicate within each domain to remove redundant documents from similar sources (e.g. Case Law and the Pile of Law), and then perform deduplication against the validation and test datasets of the Pile to avoid test leakage. We train for multiple epochs in each dataset, tracking validation perplexity every 10B tokens, and perform early stopping. Hyperparameters, including k, λ, and τ, are chosen based on the validation data in a domain-specific manner. Table 11 (D) reports perplexity of the parametric LMs on the validation data that is analogous to Table 2. Table 12 reports perplexity of both parametric and nonparametric LMs on the validation data that is analogous to Table 3.
Hardware Specification Yes Each model is trained with 128 A100 GPUs across 16 nodes. Speed is reported in tokens per second with a batch size of 1 using a single NVIDIA RTX 6000 GPU.
Software Dependencies No The paper refers to software components like 'LLaMA architecture', 'Open LM', 'GPT-Neo X-20B tokenizer', 'FAISS', and 'Pyserini' by name and with citations to their respective papers/repositories, but it does not specify exact version numbers for these software dependencies as required for reproducibility (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes We use 2,048 token sequences that are packed across document boundaries, and we pre-pend a beginning-of-text token to every document. We use weight decay of 0.1, the Adam optimizer with β2 = 0.95, 2,000 steps of warmup, with a cosine learning rate scheduler. As a result, we train our pd, pdsw and pdswby models for 60B, 250B, and 350B tokens in total, respectively. We use the GPT-Neo X-20B tokenizer (Black et al., 2022), with 50432 BPE types. Hyperparameters, including k, λ, and τ, are chosen based on the validation data in a domain-specific manner. For k NN-LM, each datastore is capped to 1 billion tokens due to the resource constraints. More implementation details, statistics and hyperparameter values for the datastores are reported in C. For RIC-LM, each datastore consists of text blocks with a length of 1,024 and a sliding window of 512. We use BM25 from Pyserini (Lin et al., 2021). For both parametric LMs and k NN-LM, we apply the domain-conditional PMI scoring (Holtzman et al., 2021) for determining the probability of each label. For k NN-LM, we follow a method from Shi et al. (2022) which employs the fuzzy verbalizers to expand the token set associated with each output label in our task. We perform hyperparameter search on the validation dataset of each task, considering k {128, 512, 4196, 8192}, τ {1, 3, 5, 10, 40, 80}, and different choice of datastores. The chosen hyperparameters are reported in Table 10.