Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ComPile: A Large IR Dataset from Production Sources

Authors: Aiden Grossman, Ludger Paehler, Konstantinos Parasyris, Tal Ben-Nun, Jacob Hegna, William S. Moses, Jose M Monsalve Diaz, Mircea Trofin, Johannes Doerfert

DMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Statistical analysis proves the utility of our dataset not only for large language model training, but also for the introspection into the code generation process itself as well as for training of machine-learned compiler components. [...] To evaluate the quality of our dataset and compare against other compiler-focused datasets, we train a series of small LLMs to predict code size, both before (O0) and after optimization (O3). [...] Results for O0 are shown in Table 5. We see significantly improved performance for the models trained on Com Pile, with the mean absolute percentage error (MAPE) being 17.6% for the best model trained on Com Pile versus 63.7% for the best model trained on Angha Bench.
Researcher Affiliation Collaboration 1 University of California, Davis, USA 2 School of Computation, Information and Technology, Technical University of Munich, GER 3 Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, USA 4 University of Minnesota, Twin Cities, USA 5 Department of Computer Science, University of Illinois Urbana-Champaign, USA 6 Division of Mathematics and Computer Science, Argonne National Laboratory, USA 7 Google Inc., USA
Pseudocode No The paper describes workflows and processes, such as in Figure 2 'Individual components of the dataset collection tooling', but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Open-sourcing of workflow and compiler tooling to construct massive-scale code datasets, which are easy to install and ready for scalable deployment in HPC and cloud environments. The statistics of the entire dataset constructable with the tooling are available in the appendix A. [...] 1. pypi.org/project/llvm-ir-dataset-utils/ 2. github.com/llvm-ml/llvm-ir-dataset-utils
Open Datasets Yes To work towards the first intermediate representation (IR) based models, we fully utilize the LLVM compiler infrastructure, shared by a number of languages, to generate a 1.4T Llama 2 token dataset of LLVM IR. We generated this dataset from programming languages built on the shared LLVM infrastructure, including Rust, Swift, Julia, and C/C++. [...] Permissibly licensed subset of the dataset available under huggingface.co/datasets/llvm-ml/Com Pile
Dataset Splits No To create the code size fine-tuning dataset, we take only the C and C++ portions of Com Pile, extract individual functions from the modules present, and then compile them to obtain the corresponding code size. We also collect the textual IR, tokenize it with a GPT tokenizer with a vocabulary of 32768, and discard any examples with a number of tokens that exceed the model s context length, in our case, 2048. We were able to collect approximately 2.7M functions from the C/C++ split of Com Pile while rejecting 1.8M functions that did not fit into the context window. [...] A subset of 1000 batches is used as a test set during training. [...] To evaluate the models, we create an evaluation dataset using the LLVM test suite 12.
Hardware Specification No The paper generally refers to
Software Dependencies Yes To work towards the first intermediate representation (IR) based models, we fully utilize the LLVM compiler infrastructure, shared by a number of languages, to generate a 1.4T Llama 2 token dataset of LLVM IR. [...] For these experiments we used the Llama 2 tokenizer (Touvron et al., 2023b) to be able to compare Com Pile s size to contemporary datasets. [...] We chose to use BPE tokenization (Sennrich et al., 2016) as it is one of the most commonly used techniques for tokenization for LLMs and easily adaptable to the textual component of our dataset. We gathered approximately 400 bitcode modules from each language and disassembled them into IR, training a BPE tokenizer over this data using fast BPE 11, generating several different vocabulary sizes for the various experiments. [...] To train models, we utilize the MPT architecture (Research, 2024) at sizes of 125M, 150M, 200M, and 250M parameters.
Experiment Setup Yes To create the code size fine-tuning dataset, we take only the C and C++ portions of Com Pile, extract individual functions from the modules present, and then compile them to obtain the corresponding code size. We also collect the textual IR, tokenize it with a GPT tokenizer with a vocabulary of 32768, and discard any examples with a number of tokens that exceed the model s context length, in our case, 2048. We were able to collect approximately 2.7M functions from the C/C++ split of Com Pile while rejecting 1.8M functions that did not fit into the context window. [...] To train models, we utilize the MPT architecture (Research, 2024) at sizes of 125M, 150M, 200M, and 250M parameters. Each model is pretrained for a specified number of batches depending upon the model size, coming out to approximately 20 tokens per parameter. The models are subsequently finetuned for the same number of batches on code size prediction.