reproducibilityindex.ai

Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation

Authors: Jingchang Chen, Hongxuan Tang, Zheng Chu, Qianglong Chen, Zekun Wang, Ming Liu, Bing Qin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on code generation benchmarks (Chen et al., 2021; Austin et al., 2021; Khan et al., 2023) with GPT-3.5 (Ouyang et al., 2022) and GPT-4 (Open AI, 2023), outperforming state-of-the-art methods by +9.8% on average. Experiments are further carried out on the mathematical competition benchmark, MATH (Hendrycks et al., 2021b), achieving a +6.0 improvement with GPT-4, indicating that FUNCODER can also generalize to complex reasoning. Our method is observed to be equally effective on open-source models (Meta AI, 2024; Mistral AI, 2024; Pinnaparaju et al., 2024; Rozière et al., 2023; Lozhkov et al., 2024), with an average gain over baseline of +31.5% on Human Eval and +47.7% on MATH. Additional analysis also shows the advantage of both divide-and-conquer and functional consensus.
Researcher Affiliation	Academia	Jingchang Chen Harbin Institute of Technology jcchen@ir.hit.edu.cn Hongxuan Tang Harbin Institute of Technology jeffswt@outlook.com Zheng Chu Harbin Institute of Technology zchu@ir.hit.edu.cn Qianglong Chen Zhejiang University chenqianglong.ai@gmail.com Zekun Wang Harbin Institute of Technology zkwang@ir.hit.edu.cn Harbin Institute of Technology mliu@ir.hit.edu.cn Bing Qin Harbin Institute of Technology qbin@ir.hit.edu.cn
Pseudocode	Yes	Algorithm 1 FUNCODER procedure Require: Entry func, froot = {hroot, droot, ϕ} Require: Large language model, LLM 1: function FUNCODER(fcur) 2: Divide 3: f cur, {fi} EXTRACT(LLM(fcur)) 4: for fi {fi} do 5: if bi is NOTIMPLEMENTED then 6: f i FUNCODER(fi) recursion 7: end if 8: ADDCHILD(fcur, f i ) 9: end for 10: Conquer 11: Fcur SAMPLE(LLM(f cur, CHILD(fcur)) 12: f cur FUNCONSENSUS(Fcur) 13: return f cur 14: end function 15: return FUNCODER(froot) starts from root
Open Source Code	Yes	Our code is made openly available at https://github.com/cometeme/funcoder.
Open Datasets	Yes	We choose three benchmarks for code generation evaluation: (a) Human Eval (Chen et al., 2021) includes entry-level coding questions; (b) MBPP (Austin et al., 2021) contains questions of standard library invocation and programming basics; and (c) x Code Eval (Khan et al., 2023) consists of algorithmic challenges sourced from the competitive programming platform Code Forces. ... We conduct an experiment on MATH (Hendrycks et al., 2021b), a competition-level mathematical reasoning benchmark.
Dataset Splits	No	The paper mentions "train set" in the context of creating few-shot prompts (e.g., "7-shot demonstrations constructed from the train set"), but it does not specify a distinct validation split or the percentages/counts for such splits.
Hardware Specification	Yes	We use the instruct variant of these models and inference on a single A100-80G under BF16 precision with v LLM (Kwon et al., 2023).
Software Dependencies	Yes	Weights of community models Llama3 (Meta-Llama-3-8B-Instruct), Codestral (Codestral-22B-v0.1), Stable Code (stable-codeinstruct-3b), Code Llama (Code Llama-34b-Instruct-hf) and Star Coder2 (starcoder2-15b-instruct-v0.1) are downloaded from Hugging Face (Wolf et al., 2019) and served over an Open AI-like API on a single A100-80G GPU under BF16 precision with v LLM (Kwon et al., 2023).
Experiment Setup	Yes	Implementation Details FUNCODER uses a 2-shot prompt in the divide stage and 1-shot for conquering sub-functions. The number of sampled implementations in the functional consensus is set to 11 for code generation tasks. For further implementation details, please refer to Appendix A.1. ... Functional Consensus The functional consensus is applied in the conquer stage. Formally, Consensus@k samples k-1 implementations in the conquer stage, and reuses the one produced in the divide stage, resulting in a set F of k candidate programs. Then we prompt the model with 1-shot (C.4) to generate potential inputs X for the given function and use them to feed and execute the program.