ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers
Authors: Kexun Zhang, Danqing Wang, Jingtao Xia, William Yang Wang, Lei Li
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that when equipped with ALGO, we achieve an 8 better one-submission pass rate over the Codex model and a 2.6 better one-submission pass rate over Code T, the current state-of-the-art model on Code Contests. |
| Researcher Affiliation | Academia | Kexun Zhang1 Danqing Wang1 Jingtao Xia1 William Yang Wang1 Lei Li2 1University of California Santa Barbara 2Carnegie Mellon University {kexun,danqingwang,jingtaoxia,william}@ucsb.edu leili@cs.cmu.edu |
| Pseudocode | Yes | Figure 11: The prompt we used for oracle generation and one oracle generated with it. The instructions are in blue. The language model is instructed to generate the most straightforward solution by enumerating over a very large search space of all combinations of relevant variables. The generated oracle enumerates all the possible ordered partitions of work allocations to find out the optimal one. [...] class Bruteforce Solution: def repair Cars(self, ranks: List[int], cars: int) -> int: [...] |
| Open Source Code | Yes | The problem set we used for testing, the prompts we used, the verifier and solution programs, and the test cases generated by ALGO are available at https://github.com/zkx06111/ALGO. |
| Open Datasets | Yes | We evaluate ALGO with Codex, Code T and PG-TM on all 165 problems from Code Contests test set, which are collected from Codeforces, a competitive programming website. All the three baselines we evaluate have been evaluated on Code Contests in their own paper, we follow the exact same setting to ensure a fair comparison. [17] |
| Dataset Splits | No | The paper discusses validation of inputs and generated code outputs but does not specify a train/validation/test split for the datasets used in its experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to conduct the experiments. |
| Software Dependencies | Yes | Chat GPT Code Interpreter is used as a coder that utilizes instruction enumerators. It is a variant of GPT-3.5 with the code interpreter plugin. The version we evaluated, released on March 24th, utilizes the underlying model text-davinci-002-code. |
| Experiment Setup | Yes | We use a temperature of 1.0 and resample the solution until it can pass all example cases. [...] We use the default temperature of 1.0 and resample the code solution until it can pass the example cases in the problem descriptions or reach the resample limit (N=5). |