reproducibilityindex.ai

Measuring the Impact of Programming Language Distribution

Authors: Gabriel Orlanski, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, Michele Catasta

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To ameliorate this issue, we present the Babel Code framework for execution-based evaluation of any benchmark in any language. ... Training a model on a balanced corpus results in, on average, 12.34% higher pass@k across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better pass@k on lowresource languages at the cost of only a 12.94% decrease to high-resource languages.
Researcher Affiliation	Collaboration	Gabriel Orlanski 1 2 * Kefan Xiao 2 Xavier Garcia 3 Jeffrey Hui 2 Joshua Howland 2 Jonathan Malmaud 2 Jacob Austin 3 Rishabh Singh 2 * Michele Catasta 2 * *Work Done While At Google 1Department of Computer Science, New York University, New York, New York 2Google Labs 3Google Brain. Correspondence to: Gabriel Orlanski <go533@nyu.edu>, Kefan Xiao <kfxiao@google.com>, Xavier Garcia <xgarcia@google.com>.
Pseudocode	No	The paper describes its framework design and functions but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Babel Code is open-sourced, has an extensive test suite, and supports evaluating four benchmarks in 14 languages. ... 1https://github.com/google-research/babelcode
Open Datasets	Yes	We additionally introduce a new dataset called Translating Python Programming Puzzles (TP3). We take the veriﬁcation functions from the questions in the original Python Programming Puzzles dataset (Schuster et al., 2021) to create this dataset.
Dataset Splits	No	The paper describes the training data collection and sampling strategies (natural and Unimax distributions) but does not provide explicit train/validation/test splits (e.g., percentages or counts) for the main training corpus used for the LLMs. The benchmarks are used for evaluation/testing.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, only mentioning that models are trained, and referencing a general 'UL2' objective which might imply certain hardware but not explicitly state it.
Software Dependencies	No	The paper mentions software frameworks like T5X, Seq IO, and Sentence Piece, and an optimizer called Adafactor, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	Every model has a context window of 2048 and is trained identically with the same vocabulary described in subsection 4.3. We use a base learning rate of 0.01 and a constant warmup with a step inverse decay. The number of warmup steps is kept to 10% of the total training steps per model. The total number of training steps is 38000, 77000, 190000 for the 1B, 2B, and 4B models, respectively. We use the Adafactor optimizer (Shazeer & Stern, 2018) and a batch size of 256. For every dataset, we use T = 0.8, topp = 0.95, and do not use topk.