Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Authors: Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought.
Researcher Affiliation Collaboration 1Department of Computer Science, Stanford University, California, USA 2Google DeepMind, California, USA 3Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA.
Pseudocode Yes The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an LMulator").
Open Source Code No The paper provides a project website (https://chain-of-code.github.io/) but no explicit statement within the paper's text about releasing the source code for the methodology, nor a direct link to a code repository.
Open Datasets Yes We consider a subset of challenging tasks from BIG-Bench (Srivastava et al., 2022) called BIG-Bench Hard (BBH) (Suzgun et al., 2022)... We also show results for the grade-school math (GSM8K) benchmark (Cobbe et al., 2021) in Section A.2
Dataset Splits No The paper mentions evaluating with 'few-shot prompting' using 'examples from the same problem family' or 'examples of different problems' as context, but does not provide explicit training/validation/test dataset splits in terms of percentages or sample counts for reproduction.
Hardware Specification No The paper mentions the use of various language models (e.g., text-davinci-003, PaLM-2, gpt-3.5-turbo, gpt-4) but does not provide specific hardware details such as GPU models, CPU specifications, or memory used for running the experiments.
Software Dependencies No The paper states that the implementation uses 'Python' as the code interpreter but does not provide a specific version number for Python or any other software dependencies with their versions.
Experiment Setup Yes These tasks are evaluated with few-shot prompting, whereby three examples from the same problem family are provided as context. We also introduce a new evaluation setting, cross-task prompting, whereby three examples of different problems are provided as context.