reproducibilityindex.ai

Parsel🐍: Algorithmic Reasoning with Language Models by Composing Decompositions

Authors: Eric Zelikman, Qian Huang, Gabriel Poesia, Noah Goodman, Nick Haber

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We find that, using Parsel, LLMs solve more competition-level problems in the APPS dataset, resulting in pass rates over 75% higher than prior results from directly sampling Alpha Code and Codex, while often using a smaller sample budget. Moreover, with automatically generated tests, we find that Parsel can improve the state-of-the-art pass@1 performance on Human Eval from 67% to 85%.
Researcher Affiliation	Academia	Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, Nick Haber Stanford University {ezelikman, qhwang, poesia, ngoodman, nhaber}@stanford.edu
Pseudocode	Yes	We visualize the details in Fig.1 and provide a high-level pseudocode in Fig. A.11.
Open Source Code	Yes	We release our code at https://github.com/ezelikman/parsel.
Open Datasets	Yes	We evaluated Parsel on the competition-level subset of the APPS [27] dataset as follows: [...] We next tested Parsel on Human Eval[12].
Dataset Splits	No	The paper evaluates its method on subsets of the APPS and HumanEval datasets and discusses testing and pass rates, but it does not specify any explicit training/validation/test dataset splits with percentages or sample counts.
Hardware Specification	Yes	The most computationally intensive part of this research, by far (in terms of FLOPS), was the ablation using an open-source Code Gen model, which required several-hundred A100 hours.
Software Dependencies	No	The paper mentions using Python, PyTorch, and TensorFlow libraries but does not provide specific version numbers for any of its software dependencies.
Experiment Setup	Yes	We sample everything with temperature=0.6, except the translations which we sample with temperature=0.2, a presence penalty of 0.1, and a logit bias to prevent it from generating the text def , as Codex has a tendency to degenerate to producing Python even when prompted with Parsel examples. We allow at most 500 tokens per function, but in practice found that they typically used less than half of them. For evaluation, we use a timeout of 0.04 seconds per solution and evaluate at most 100,000 implementations per generated Parsel program.