SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning

Authors: Yangruibo Ding, Jinjun Peng, Marcus Min, Gail Kaiser, Junfeng Yang, Baishakhi Ray

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose training Code LLMs not only to write code but also to understand code semantics by reasoning about key properties, constraints, and execution behaviors using natural language, mimicking human verbal debugging, i.e., rubber-duck debugging. This approach led to the development of SEMCODER, a Code LLM with only 6.7B parameters, which shows competitive performance with GPT-3.5-turbo on code generation and execution reasoning tasks. SEMCODER achieves 79.3% on Human Eval (GPT-3.5-turbo: 76.8%), 63.6% on CRUXEval-I (GPT-3.5-turbo: 50.3%), and 63.9% on CRUXEval-O (GPT-3.5turbo: 59.0%).
Researcher Affiliation Academia Yangruibo Ding Columbia University yrbding@cs.columbia.edu Jinjun Peng Columbia University jinjun@cs.columbia.edu Marcus J. Min Columbia University jm5025@columbia.edu Gail Kaiser Columbia University kaiser@cs.columbia.edu Junfeng Yang Columbia University junfeng@cs.columbia.edu Baishakhi Ray Columbia University rayb@cs.columbia.edu
Pseudocode No The paper describes the 'monologue reasoning' strategy and various processes in text, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our data, code, and models are available at: https://github.com/ARi SE-Lab/Sem Coder.
Open Datasets Yes Our data, code, and models are available at: https://github.com/ARi SE-Lab/Sem Coder. We collect PYX, a synthetic dataset capturing comprehensive program semantics with executable code samples and unit tests.
Dataset Splits No The paper specifies external evaluation datasets (e.g., Human Eval, CRUXEval) for testing, but it does not explicitly detail training/validation/test splits for its internally curated PYX dataset used for training SEMCODER.
Hardware Specification Yes All SEMCODER variants are trained for 2 epochs on a server with eight NVIDIA RTX A6000 GPUs
Software Dependencies No The paper mentions using a 'Python interpreter' and models like 'GPT-3.5-turbo' and 'Deep Seek Coder', but it does not provide specific version numbers for software dependencies or libraries (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes All SEMCODER variants are trained for 2 epochs on a server with eight NVIDIA RTX A6000 GPUs, using a learning rate of 5e-5 with a cosine decay to 5e-6 during the program semantics training. For self-refinement fine-tuning, SEMCODER and baseline Code LLMs are trained for 2 epochs with a learning rate of 1e-5. We use a batch size of 512, a maximum context length of 2,048.