Parsel🐍: Algorithmic Reasoning with Language Models by Composing Decompositions
Authors: Eric Zelikman, Qian Huang, Gabriel Poesia, Noah Goodman, Nick Haber
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that, using Parsel, LLMs solve more competition-level problems in the APPS dataset, resulting in pass rates over 75% higher than prior results from directly sampling Alpha Code and Codex, while often using a smaller sample budget. Moreover, with automatically generated tests, we find that Parsel can improve the state-of-the-art pass@1 performance on Human Eval from 67% to 85%. |
| Researcher Affiliation | Academia | Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, Nick Haber Stanford University {ezelikman, qhwang, poesia, ngoodman, nhaber}@stanford.edu |
| Pseudocode | Yes | We visualize the details in Fig.1 and provide a high-level pseudocode in Fig. A.11. |
| Open Source Code | Yes | We release our code at https://github.com/ezelikman/parsel. |
| Open Datasets | Yes | We evaluated Parsel on the competition-level subset of the APPS [27] dataset as follows: [...] We next tested Parsel on Human Eval[12]. |
| Dataset Splits | No | The paper evaluates its method on subsets of the APPS and HumanEval datasets and discusses testing and pass rates, but it does not specify any explicit training/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | Yes | The most computationally intensive part of this research, by far (in terms of FLOPS), was the ablation using an open-source Code Gen model, which required several-hundred A100 hours. |
| Software Dependencies | No | The paper mentions using Python, PyTorch, and TensorFlow libraries but does not provide specific version numbers for any of its software dependencies. |
| Experiment Setup | Yes | We sample everything with temperature=0.6, except the translations which we sample with temperature=0.2, a presence penalty of 0.1, and a logit bias to prevent it from generating the text def , as Codex has a tendency to degenerate to producing Python even when prompted with Parsel examples. We allow at most 500 tokens per function, but in practice found that they typically used less than half of them. For evaluation, we use a timeout of 0.04 seconds per solution and evaluate at most 100,000 implementations per generated Parsel program. |