Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Measuring the Impact of Programming Language Distribution
Authors: Gabriel Orlanski, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, Michele Catasta
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To ameliorate this issue, we present the Babel Code framework for execution-based evaluation of any benchmark in any language. ... Training a model on a balanced corpus results in, on average, 12.34% higher pass@k across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better pass@k on lowresource languages at the cost of only a 12.94% decrease to high-resource languages. |
| Researcher Affiliation | Collaboration | Gabriel Orlanski 1 2 * Kefan Xiao 2 Xavier Garcia 3 Jeffrey Hui 2 Joshua Howland 2 Jonathan Malmaud 2 Jacob Austin 3 Rishabh Singh 2 * Michele Catasta 2 * *Work Done While At Google 1Department of Computer Science, New York University, New York, New York 2Google Labs 3Google Brain. Correspondence to: Gabriel Orlanski <EMAIL>, Kefan Xiao <EMAIL>, Xavier Garcia <EMAIL>. |
| Pseudocode | No | The paper describes its framework design and functions but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Babel Code is open-sourced, has an extensive test suite, and supports evaluating four benchmarks in 14 languages. ... 1https://github.com/google-research/babelcode |
| Open Datasets | Yes | We additionally introduce a new dataset called Translating Python Programming Puzzles (TP3). We take the verification functions from the questions in the original Python Programming Puzzles dataset (Schuster et al., 2021) to create this dataset. |
| Dataset Splits | No | The paper describes the training data collection and sampling strategies (natural and Unimax distributions) but does not provide explicit train/validation/test splits (e.g., percentages or counts) for the main training corpus used for the LLMs. The benchmarks are used for evaluation/testing. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, only mentioning that models are trained, and referencing a general 'UL2' objective which might imply certain hardware but not explicitly state it. |
| Software Dependencies | No | The paper mentions software frameworks like T5X, Seq IO, and Sentence Piece, and an optimizer called Adafactor, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Every model has a context window of 2048 and is trained identically with the same vocabulary described in subsection 4.3. We use a base learning rate of 0.01 and a constant warmup with a step inverse decay. The number of warmup steps is kept to 10% of the total training steps per model. The total number of training steps is 38000, 77000, 190000 for the 1B, 2B, and 4B models, respectively. We use the Adafactor optimizer (Shazeer & Stern, 2018) and a batch size of 256. For every dataset, we use T = 0.8, topp = 0.95, and do not use topk. |