$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis

Authors: Zishun Yu, Yunzhe Tao, Liyu Chen, Tao Sun, Hongxia Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluations demonstrated B-Coder s capability in achieving state-of-the-art performance when compared to policy-based methods. Remarks on rθ. To further explain the motivation of ranking with rθ, consider a realistic deployment setting where a fine-tuned model is deployed for end-user applications. Our results in Section 5 follow this evaluation protocol. Table 1: Empirical evaluation on APPS test set.
Researcher Affiliation Collaboration Zishun Yu Department of Computer Science University of Illinois Chicago Chicago, IL 60607 zyu32@uic.edu Yunzhe Tao, Liyu Chen, Tao Sun & Hongxia Yang Byte Dance Inc. Seattle, WA 98004 {yunzhe.tao, liyu.chen1, tao.sun, hx.yang}@bytedance.com
Pseudocode Yes A pseudo-algorithm could be found in Appendix A. Algorithm 1 Training Procedure with ϕ and θ-stages Algorithm 2 Sampling Procedure
Open Source Code No The paper initializes its model with Code RL checkpoint publicly available at here, and refers to another project's repository (WizardCoder) for MBPP input format, but it does not contain an explicit statement or link to its own open-source code for the described methodology.
Open Datasets Yes APPS benchmark and baselines. In line with prior RL-based works (Le et al., 2022; Shojaee et al., 2023; Liu et al., 2023), we evaluate B-Coder on the challenging code contests benchmark APPS (Hendrycks et al., 2021). MBPP dataset. MBPP has 974 instances with a 374/90/500 train/val/test splits and, in addition, 10 problems reserved for few-shot learning.
Dataset Splits Yes APPS benchmark and baselines. In line with prior RL-based works (Le et al., 2022; Shojaee et al., 2023; Liu et al., 2023), we evaluate B-Coder on the challenging code contests benchmark APPS (Hendrycks et al., 2021). It contains 5,000 training and 5,000 testing problems, with three difficulty levels: introductory, interview and competition. MBPP dataset. MBPP has 974 instances with a 374/90/500 train/val/test splits and, in addition, 10 problems reserved for few-shot learning.
Hardware Specification Yes ϕ-stage training. In the ϕ-stage, we pre-train state-value function Vϕ(s). We conduct our experiment with 4 A100-80G GPUs. θ-stage training. In the θ-stage, we conduct our experiment with 8 A100-80G GPUs.
Software Dependencies No The paper mentions using T5 as a base architecture and Adam W optimizer (Loshchilov & Hutter, 2018), but does not specify software versions for libraries, programming languages, or other key components.
Experiment Setup Yes Specifically, we use batch size of 16 for each GPU and gradient accumulation step of 4, resulting in a total batch size of 256. For optimizer and scheduler, we use Adam W optimizer (Loshchilov & Hutter, 2018) with a constant learning rate of 1e-5 and a weight decay of 0.05. In the θ-stage, we use batch size of 16 for each GPU and gradient accumulation step of 1, resulting in a total batch size of 128. For optimizer and scheduler, we use Adam W with a peak learning rate 3e-5, a weight decay of 0.05, and a linear decay scheduler with no warmup. We train θ for 10k gradient steps. We set the ground truth data ratio ρreal =0.5 and the energy-based policy temperature α=1 (see equation 10) for all experiments. In θ-stage, we use βadv =0.1 and βce =0.5.