Verified Code Transpilation with LLMs

Authors: Sahil Bhatia, Jie Qiu, Niranjan Hasabnis, Sanjit Seshia, Alvin Cheung

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach not only outperforms previous symbolic-based tools in both the number of benchmarks transpiled and transpilation time, but also requires significantly less effort to build. 4 Experiments To evaluate the effectiveness of LLMLIFT, we evaluate across four distinct DSLs3, each targeting a different application domain:
Researcher Affiliation Collaboration Sahil Bhatia1 Jie Qiu Niranjan Hasabnis2 Sanjit A. Seshia1 Alvin Cheung1 1UC Berkeley 2Code Metal {sahilbhatia, jieq, sseshia, akcheung}@berkeley.edu niranjan@codemetal.ai
Pseudocode Yes Figure 5: LLMLIFT’s algorithm for generating PS and Invs.
Open Source Code No All the benchmarks used for evaluation can be found at: https://github.com/metalift/metalift/tree/llmlift/llmlift/benchmarks - This link explicitly provides benchmarks, not the methodology's source code, and there is no other unambiguous statement of code release for the method itself.
Open Datasets Yes All the benchmarks used for evaluation can be found at: https://github.com/metalift/metalift/tree/llmlift/llmlift/benchmarks
Dataset Splits No The paper describes benchmarks used for evaluation but does not specify any training/validation/test dataset splits with percentages or counts.
Hardware Specification No In all experiments, we use GPT-4 via their APIs to generate candidates. We queried these models using their API endpoints. The paper relies on external LLM APIs and does not specify local hardware used for experiments.
Software Dependencies No The parser and logic for invoking the LLMs are implemented in Python with 700 Lo C. The paper mentions Python but does not provide specific version numbers for Python or any other software libraries used.
Experiment Setup Yes We set the temperature to 0.7 for all the experiments. For program summary and invariant generation across all domains, we use the same zero-shot PS prompt in Fig. 6 and one-shot prompt in Fig. 7, respectively. We keep a budget of 50 queries for the PS and a budget of 10 queries for each PS.