Measuring the Impact of Programming Language Distribution
Authors: Gabriel Orlanski, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, Michele Catasta
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To ameliorate this issue, we present the Babel Code framework for execution-based evaluation of any benchmark in any language. ... Training a model on a balanced corpus results in, on average, 12.34% higher pass@k across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better pass@k on lowresource languages at the cost of only a 12.94% decrease to high-resource languages. |
| Researcher Affiliation | Collaboration | Gabriel Orlanski 1 2 * Kefan Xiao 2 Xavier Garcia 3 Jeffrey Hui 2 Joshua Howland 2 Jonathan Malmaud 2 Jacob Austin 3 Rishabh Singh 2 * Michele Catasta 2 * *Work Done While At Google 1Department of Computer Science, New York University, New York, New York 2Google Labs 3Google Brain. Correspondence to: Gabriel Orlanski <go533@nyu.edu>, Kefan Xiao <kfxiao@google.com>, Xavier Garcia <xgarcia@google.com>. |
| Pseudocode | No | The paper describes its framework design and functions but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Babel Code is open-sourced, has an extensive test suite, and supports evaluating four benchmarks in 14 languages. ... 1https://github.com/google-research/babelcode |
| Open Datasets | Yes | We additionally introduce a new dataset called Translating Python Programming Puzzles (TP3). We take the veriļ¬cation functions from the questions in the original Python Programming Puzzles dataset (Schuster et al., 2021) to create this dataset. |
| Dataset Splits | No | The paper describes the training data collection and sampling strategies (natural and Unimax distributions) but does not provide explicit train/validation/test splits (e.g., percentages or counts) for the main training corpus used for the LLMs. The benchmarks are used for evaluation/testing. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, only mentioning that models are trained, and referencing a general 'UL2' objective which might imply certain hardware but not explicitly state it. |
| Software Dependencies | No | The paper mentions software frameworks like T5X, Seq IO, and Sentence Piece, and an optimizer called Adafactor, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Every model has a context window of 2048 and is trained identically with the same vocabulary described in subsection 4.3. We use a base learning rate of 0.01 and a constant warmup with a step inverse decay. The number of warmup steps is kept to 10% of the total training steps per model. The total number of training steps is 38000, 77000, 190000 for the 1B, 2B, and 4B models, respectively. We use the Adafactor optimizer (Shazeer & Stern, 2018) and a batch size of 256. For every dataset, we use T = 0.8, topp = 0.95, and do not use topk. |