reproducibilityindex.ai

Language models are multilingual chain-of-thought reasoners

Authors: Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, Jason Wei

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically diverse languages. We ﬁnd that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili. Finally, we show that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and wordin-context semantic judgment. The MGSM benchmark is publicly available at https://github.com/google-research/url-nlp.
Researcher Affiliation	Collaboration	Freda Shi1,2, Mirac Suzgun1,3, Markus Freitag1 Xuezhi Wang1 Suraj Srivats4 Soroush Vosoughi4 Hyung Won Chung1 Yi Tay1 Sebastian Ruder1 Denny Zhou1 Dipanjan Das1 Jason Wei1 1Google Research 2Toyota Technological Institute at Chicago 3Stanford University 4Dartmouth College
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The MGSM benchmark is publicly available at https://github.com/google-research/url-nlp. (This link is for the benchmark, not the source code of the methodology itself.)
Open Datasets	Yes	We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically diverse languages. The MGSM benchmark is publicly available at https://github.com/google-research/url-nlp.
Dataset Splits	Yes	For each target language, XCOPA contains 100 annotated examples in the validation set and 500 examples in the test set. In our experiments, we focus on the examples in the test sets and use the ones in the validation set as few-shot exemplars whenever needed.
Hardware Specification	No	The paper does not provide specific details on the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies	No	The paper does not provide specific software dependencies or their version numbers required to replicate the experiments.
Experiment Setup	Yes	We provide an overview of standard prompting and chain-of-thought prompting, as well as their extensions to the multilingual setting, which we illustrate in Table 1 and use in our experiments ( 4). (...) Throughout this paper, we generate outputs using greedy decoding (i.e., sampling with temperature τ = 0). (...) Effect of exemplar amount. We analyze how the multilingual reasoning performance of Pa LM-540B, the overall best-performing model, is affected by the number of few-shot exemplars (Figure 5).