reproducibilityindex.ai

Language Models can Solve Computer Tasks

Authors: Geunwoo Kim, Pierre Baldi, Stephen McAleer

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the RCI approach on the Mini Wo B++ benchmark [61], and show it surpasses existing SL, RL, and LLM approaches. Furthermore, it proves itself to state-of-the-art compared to existing methods, using only a small number of demonstrations per task instead of tens of thousands, and without relying on a task-specific reward function.
Researcher Affiliation	Academia	Geunwoo Kim University of California, Irvine kgw@uci.edu Pierre Baldi University of California, Irvine pfbaldi@ics.uci.edu Stephen Mc Aleer Carnegie Mellon University smcaleer@cs.cmu.edu
Pseudocode	No	The paper describes the RCI prompting scheme and its application to computer tasks through descriptive text and illustrative figures (e.g., Figure 3), but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Our code can be found here: https://github.com/posgnu/rci-agent.
Open Datasets	Yes	We evaluate the RCI approach on the Mini Wo B++ benchmark [61]
Dataset Splits	No	The paper evaluates its method on the Mini Wo B++ benchmark using 'a handful of demonstrations per task' for in-context learning. While it references training data for other approaches and uses a stopping condition for RCI loops on reasoning tasks, it does not explicitly provide details about train/validation/test splits for its own computer task experiments or how data was partitioned for validation purposes in a reproducible manner.
Hardware Specification	No	All models are accessed through the Open AI API between January 2023 and March 2023.
Software Dependencies	No	The paper states that 'All models are accessed through the Open AI API between January 2023 and March 2023' and lists API names such as 'gpt-3.5-turbo' and 'gpt-4'. However, it does not provide specific version numbers for ancillary software dependencies or libraries (e.g., Python, PyTorch, TensorFlow) that would be needed to replicate the experimental setup locally.
Experiment Setup	Yes	For all model usage, a maximum token length of 256 and a temperature value of 0, indicating greedy decoding, are used.