Multi-lingual Evaluation of Code Generation Models

Authors: Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, Bing Xiang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present two new benchmarks, MBXP and Multilingual Human Eval, designed to evaluate code generation models in over 10 programming languages. These datasets are generated using a conversion framework that transpiles prompts and test cases from the original MBPP and Human Eval datasets into the corresponding data in the target language. By using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities.
Researcher Affiliation Industry Ben Athiwaratkun , Sanjay Krishna Gouda , Zijian Wang , Xiaopeng Li , Yuchen Tian , Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, Bing Xiang AWS AI Labs Corresponding authors {benathi,skgouda,zijwan,xiaopel,tiayuche,bxiang}@amazon.com
Pseudocode No The paper describes its methods and frameworks in narrative text and uses figures to illustrate processes (e.g., Figure 1 for Benchmark Construction, Figure 2 for Conversion of formatted MBPP), but it does not contain any formal pseudocode blocks or algorithms labeled as such.
Open Source Code Yes We release the data and evaluation code at https://github.com/amazon-research/ mbxp-exec-eval.
Open Datasets Yes The result of such conversion are two benchmarks, MBXP and Multilingual Human Eval, which are derived from the original Python dataset MBPP (Austin et al., 2021) and Human Eval (Chen et al., 2021). We also release a code package to perform execution in all supported languages. In addition, our conversion framework is easily extensible and allows us to obtain the multi-lingual version of other existing datasets such as Math QA (Schubotz et al., 2019).
Dataset Splits Yes We randomly split 0.1% data as validation set.
Hardware Specification No The paper discusses training details such as the amount of tokens, model architecture, optimizer settings, and the use of bfloat16 and Deep Speed for optimization. However, it does not specify any particular hardware components like GPU models (e.g., NVIDIA A100), CPU types, or details about the computing infrastructure used for experiments.
Software Dependencies No Our training pipeline is based on Py Torch Lightning and we use bfloat16 (Kalamkar et al., 2019) and Deep Speed (Rasley et al., 2020) for training optimization. We adapted the human-eval repository by Open AI which provides multi-thread execution-based evaluation framework in Python along with unbiased pass@k calculation.
Experiment Setup Yes We use nucleus sampling with p = 0.95 (Holtzman et al., 2020). For all experiments, limit the input length to be 1792 and generate up to 256 tokens. We train using 210B tokens for mono-lingual models, and 630B tokens for multi-lingual models with 210B tokens from each language. Across all models, we use max sequence length of 2048, and use larger batch size for larger models, while reducing max steps accordingly to train all models with same amount of per-language tokens. For example, for 13B, we use batch size of 1024 and max steps of 100,000 with 2048 sequence length... We use Adam W optimizer (Loshchilov & Hutter, 2018) with β1 = 0.9, β2 = 0.95, and ϵ = 10 8. We use warm up steps of 2000 steps with cosine annealing after peak learning rate, and the mininum learning rate being 10% of corresponding peak learning rate, weight decay of 0.01, and gradient clipping of 1.0.