Verified Code Transpilation with LLMs
Authors: Sahil Bhatia, Jie Qiu, Niranjan Hasabnis, Sanjit Seshia, Alvin Cheung
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach not only outperforms previous symbolic-based tools in both the number of benchmarks transpiled and transpilation time, but also requires significantly less effort to build. 4 Experiments To evaluate the effectiveness of LLMLIFT, we evaluate across four distinct DSLs3, each targeting a different application domain: |
| Researcher Affiliation | Collaboration | Sahil Bhatia1 Jie Qiu Niranjan Hasabnis2 Sanjit A. Seshia1 Alvin Cheung1 1UC Berkeley 2Code Metal {sahilbhatia, jieq, sseshia, akcheung}@berkeley.edu niranjan@codemetal.ai |
| Pseudocode | Yes | Figure 5: LLMLIFT’s algorithm for generating PS and Invs. |
| Open Source Code | No | All the benchmarks used for evaluation can be found at: https://github.com/metalift/metalift/tree/llmlift/llmlift/benchmarks - This link explicitly provides benchmarks, not the methodology's source code, and there is no other unambiguous statement of code release for the method itself. |
| Open Datasets | Yes | All the benchmarks used for evaluation can be found at: https://github.com/metalift/metalift/tree/llmlift/llmlift/benchmarks |
| Dataset Splits | No | The paper describes benchmarks used for evaluation but does not specify any training/validation/test dataset splits with percentages or counts. |
| Hardware Specification | No | In all experiments, we use GPT-4 via their APIs to generate candidates. We queried these models using their API endpoints. The paper relies on external LLM APIs and does not specify local hardware used for experiments. |
| Software Dependencies | No | The parser and logic for invoking the LLMs are implemented in Python with 700 Lo C. The paper mentions Python but does not provide specific version numbers for Python or any other software libraries used. |
| Experiment Setup | Yes | We set the temperature to 0.7 for all the experiments. For program summary and invariant generation across all domains, we use the same zero-shot PS prompt in Fig. 6 and one-shot prompt in Fig. 7, respectively. We keep a budget of 50 queries for the PS and a budget of 10 queries for each PS. |