Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Can Large Language Models Understand Intermediate Representations in Compilers?
Authors: Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we present an explorative empirical study evaluating the capabilities of six state-of-the-art LLMs GPT-4, GPT-3, Deep Seek, Gemma 2, Llama 3, and Code Llama in understanding IRs. Specifically, we assess model performance across four core tasks: control flow graph reconstruction, IR decompilation, code summarization, and execution reasoning. |
| Researcher Affiliation | Academia | 1Kent State University, USA 2Huazhong University of Science and Technology, China 3Pacific Northwest National Laboratory, USA 4Chongqing University, China. Correspondence to: Yao Wan <EMAIL>, Bo Fang <EMAIL>, Qiang Guan <EMAIL>. |
| Pseudocode | No | The paper describes methods and tasks but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | All the experimental data and source code are publicly available at https://github.com/hjiang13/LLM4IR. |
| Open Datasets | Yes | All evaluations are conducted on a benchmark dataset derived from Human Eval (Zheng et al., 2023), consisting of 164 C++ programs paired with their corresponding LLVM IRs. Each program is compiled using Clang at four optimization levels -O0, -O1, -O2, and -O3 to generate a diverse set of LLVM IRs that capture both unoptimized and progressively optimized code structures. |
| Dataset Splits | No | The paper uses the Human Eval benchmark which comprises 164 programming tasks. It does not explicitly mention training/test/validation dataset splits, as the focus is on evaluating LLMs on these benchmark tasks rather than training a new model with specific splits. |
| Hardware Specification | Yes | The compilation experiments were conducted on a Dell Workstation equipped with 32 Intel(R) Xeon(R) CPUs E5-2620 v4 @ 2.10GHz, running on an x86-64 architecture with a 64-bit system. |
| Software Dependencies | Yes | For these experiments, we used Clang adapted for LLVM 13 on Ubuntu 18.04. |
| Experiment Setup | Yes | All evaluations are conducted on a benchmark dataset derived from Human Eval (Zheng et al., 2023), consisting of 164 C++ programs paired with their corresponding LLVM IRs. Each program is compiled using Clang at four optimization levels -O0, -O1, -O2, and -O3 to generate a diverse set of LLVM IRs that capture both unoptimized and progressively optimized code structures. To enhance response precision and consistency, we adopt an Expert Meta-Template Prompt format. For each of the four tasks (i.e., CFG reconstruction, IR decompilation, code summarization, and execution reasoning), we iteratively refine prompts using strategies such as few-shot learning and Co T prompting (Wei et al., 2022; Xie et al., 2025). |