reproducibilityindex.ai

Verified Code Transpilation with LLMs

Authors: Sahil Bhatia, Jie Qiu, Niranjan Hasabnis, Sanjit Seshia, Alvin Cheung

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach not only outperforms previous symbolic-based tools in both the number of benchmarks transpiled and transpilation time, but also requires significantly less effort to build. 4 Experiments To evaluate the effectiveness of LLMLIFT, we evaluate across four distinct DSLs3, each targeting a different application domain:
Researcher Affiliation	Collaboration	Sahil Bhatia1 Jie Qiu Niranjan Hasabnis2 Sanjit A. Seshia1 Alvin Cheung1 1UC Berkeley 2Code Metal {sahilbhatia, jieq, sseshia, akcheung}@berkeley.edu niranjan@codemetal.ai
Pseudocode	Yes	Figure 5: LLMLIFT’s algorithm for generating PS and Invs.
Open Source Code	No	All the benchmarks used for evaluation can be found at: https://github.com/metalift/metalift/tree/llmlift/llmlift/benchmarks - This link explicitly provides benchmarks, not the methodology's source code, and there is no other unambiguous statement of code release for the method itself.
Open Datasets	Yes	All the benchmarks used for evaluation can be found at: https://github.com/metalift/metalift/tree/llmlift/llmlift/benchmarks
Dataset Splits	No	The paper describes benchmarks used for evaluation but does not specify any training/validation/test dataset splits with percentages or counts.
Hardware Specification	No	In all experiments, we use GPT-4 via their APIs to generate candidates. We queried these models using their API endpoints. The paper relies on external LLM APIs and does not specify local hardware used for experiments.
Software Dependencies	No	The parser and logic for invoking the LLMs are implemented in Python with 700 Lo C. The paper mentions Python but does not provide specific version numbers for Python or any other software libraries used.
Experiment Setup	Yes	We set the temperature to 0.7 for all the experiments. For program summary and invariant generation across all domains, we use the same zero-shot PS prompt in Fig. 6 and one-shot prompt in Fig. 7, respectively. We keep a budget of 50 queries for the PS and a budget of 10 queries for each PS.