reproducibilityindex.ai

ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers

Authors: Kexun Zhang, Danqing Wang, Jingtao Xia, William Yang Wang, Lei Li

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that when equipped with ALGO, we achieve an 8 better one-submission pass rate over the Codex model and a 2.6 better one-submission pass rate over Code T, the current state-of-the-art model on Code Contests.
Researcher Affiliation	Academia	Kexun Zhang1 Danqing Wang1 Jingtao Xia1 William Yang Wang1 Lei Li2 1University of California Santa Barbara 2Carnegie Mellon University {kexun,danqingwang,jingtaoxia,william}@ucsb.edu leili@cs.cmu.edu
Pseudocode	Yes	Figure 11: The prompt we used for oracle generation and one oracle generated with it. The instructions are in blue. The language model is instructed to generate the most straightforward solution by enumerating over a very large search space of all combinations of relevant variables. The generated oracle enumerates all the possible ordered partitions of work allocations to find out the optimal one. [...] class Bruteforce Solution: def repair Cars(self, ranks: List[int], cars: int) -> int: [...]
Open Source Code	Yes	The problem set we used for testing, the prompts we used, the verifier and solution programs, and the test cases generated by ALGO are available at https://github.com/zkx06111/ALGO.
Open Datasets	Yes	We evaluate ALGO with Codex, Code T and PG-TM on all 165 problems from Code Contests test set, which are collected from Codeforces, a competitive programming website. All the three baselines we evaluate have been evaluated on Code Contests in their own paper, we follow the exact same setting to ensure a fair comparison. [17]
Dataset Splits	No	The paper discusses validation of inputs and generated code outputs but does not specify a train/validation/test split for the datasets used in its experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to conduct the experiments.
Software Dependencies	Yes	Chat GPT Code Interpreter is used as a coder that utilizes instruction enumerators. It is a variant of GPT-3.5 with the code interpreter plugin. The version we evaluated, released on March 24th, utilizes the underlying model text-davinci-002-code.
Experiment Setup	Yes	We use a temperature of 1.0 and resample the solution until it can pass all example cases. [...] We use the default temperature of 1.0 and resample the code solution until it can pass the example cases in the problem descriptions or reach the resample limit (N=5).