CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning
Authors: Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven Chu Hong Hoi
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark. Our comprehensive experiments show that our models can achieve SOTA performance on the challenging APPS benchmark [Hendrycks et al., 2021]. |
| Researcher Affiliation | Industry | Hung Le , Yue Wang , Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi Salesforce Research |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Salesforce Research https://github.com/salesforce/Code RL |
| Open Datasets | Yes | We choose the challenging APPS program synthesis benchmark [Hendrycks et al., 2021], as it has large coding problems of varying difficulties collected from multiple coding websites. Finally, we test the zero-shot transfer ability of Code RL on another smaller and simpler program synthesis benchmark MBPP [Austin et al., 2021]. We enlarge the Python pretraining dataset using the recently released large-scale Github Code dataset5. https://huggingface.co/datasets/lvwerra/github-code |
| Dataset Splits | No | The paper mentions 'training and test splits' for the APPS benchmark but does not explicitly state details of a validation split or how it was used. |
| Hardware Specification | Yes | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See the configurations in Appendix ??. For training, we used 16 NVIDIA A100 80GB GPUs. For inference, we use 4 NVIDIA V100 32GB GPUs. (from Appendix D.1 Experimental Details) |
| Software Dependencies | No | The paper mentions various software components and models (e.g., Code T5, GPT models, Python) but does not provide specific version numbers for them. |
| Experiment Setup | Yes | Finetuning Setup...we applied imitation learning to first warm-start a pretrained LM model with Lce only for up to 10 epochs...After training the critic, we then apply both Lce and Lrl with equal weights to finetune the actor network. We use nucleus sampling with a batch size of N = 200. |