Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
Authors: Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Stanford University, California, USA 2Google DeepMind, California, USA 3Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA. |
| Pseudocode | Yes | The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an LMulator"). |
| Open Source Code | No | The paper provides a project website (https://chain-of-code.github.io/) but no explicit statement within the paper's text about releasing the source code for the methodology, nor a direct link to a code repository. |
| Open Datasets | Yes | We consider a subset of challenging tasks from BIG-Bench (Srivastava et al., 2022) called BIG-Bench Hard (BBH) (Suzgun et al., 2022)... We also show results for the grade-school math (GSM8K) benchmark (Cobbe et al., 2021) in Section A.2 |
| Dataset Splits | No | The paper mentions evaluating with 'few-shot prompting' using 'examples from the same problem family' or 'examples of different problems' as context, but does not provide explicit training/validation/test dataset splits in terms of percentages or sample counts for reproduction. |
| Hardware Specification | No | The paper mentions the use of various language models (e.g., text-davinci-003, PaLM-2, gpt-3.5-turbo, gpt-4) but does not provide specific hardware details such as GPU models, CPU specifications, or memory used for running the experiments. |
| Software Dependencies | No | The paper states that the implementation uses 'Python' as the code interpreter but does not provide a specific version number for Python or any other software dependencies with their versions. |
| Experiment Setup | Yes | These tasks are evaluated with few-shot prompting, whereby three examples from the same problem family are provided as context. We also introduce a new evaluation setting, cross-task prompting, whereby three examples of different problems are provided as context. |