What's In My Big Data?
Authors: Yanai Elazar, Akshita Bhagia, Ian Helgi Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Evan Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hannaneh Hajishirzi, Noah A. Smith, Jesse Dodge
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose WHAT S IN MY BIG DATA? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities count and search at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and Red Pajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. |
| Researcher Affiliation | Collaboration | Yanai Elazar1,2 Akshita Bhagia1 Ian Magnusson1 Abhilasha Ravichander1 Dustin Schwenk1 Alane Suhr3 Pete Walsh1 Dirk Groeneveld1 Luca Soldaini1 Sameer Singh4 Hannaneh Hajishirzi1,2 Noah A. Smith1,2 Jesse Dodge1 1Allen Institute for AI 2Paul G. Allen School of Computer Science & Engineering, University of Washington 3University of California, Berkeley 4University of California, Irvine |
| Pseudocode | No | The paper describes algorithms in Appendix E, such as "To collect the (approximate) top-k n-grams..." and "To estimate the number of unique n-grams...", but it does not present these algorithms in a formatted pseudocode or algorithm block. |
| Open Source Code | Yes | We open-source WIMBD s code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them. https://github.com/allenai/wimbd |
| Open Datasets | Yes | We cover ten different large corpora, spanning across text-only (e.g., C4) to image captions (LAION-2B-en) and code (The Stack). These corpora have been used in training language models (or similar large-scale models, such as Stable Diffusion; Rombach et al. 2022). A high-level description of these datasets using WIMBD is presented in Table 2, and further details about the construction and origin of these corpora are detailed in Appendix A. Appendix A provides citations and URLs for accessing these datasets (e.g., Open Web Text (Gokaslan & Cohen, 2019) URL: https://skylion007.github.io/Open Web Text Corpus/, LAION (Schuhmann et al., 2022) URL: https://openreview.net/forum?id=M3Y74vms Mc Y). |
| Dataset Splits | Yes | We measure contamination by testing whether all input fields are present in a single document and report the percentage of contaminated examples from the test set. Our contamination evaluation serves as an upper bound of exact-match dataset contamination. We provide more details of our analysis and design choices in Appendix B.3.1. We filter datasets we cannot automatically download, from Huggingface datasets (Lhoest et al., 2021), and datasets that do not have a test split. |
| Hardware Specification | Yes | We run our experiments using a compute node machine with 224 CPUs and 882GB RAM, and an Elasticsearch cluster for the indexed corpora. |
| Software Dependencies | No | The paper mentions using Elasticsearch for its search capabilities and states experiments were run on a Google Cloud compute node. However, it does not provide specific version numbers for Elasticsearch, Python, or any other software libraries used in the analysis. |
| Experiment Setup | Yes | Using these tools, we perform a set of sixteen analyses on ten different English corpora used to train LMs. We divide our analyses into four categories: (1) data statistics (e.g., number of tokens and domain distribution; 4.2); (2) data quality (e.g., most frequent n-grams and measuring duplicate documents; 4.3); (3) communityand societyrelevant measurements (e.g., benchmark contamination and personally identifiable information detection; 4.4); and (4) cross-corpora analysis (e.g., comparing the most common n-gram and document overlap; B.4). |