Language Models can Solve Computer Tasks

Authors: Geunwoo Kim, Pierre Baldi, Stephen McAleer

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the RCI approach on the Mini Wo B++ benchmark [61], and show it surpasses existing SL, RL, and LLM approaches. Furthermore, it proves itself to state-of-the-art compared to existing methods, using only a small number of demonstrations per task instead of tens of thousands, and without relying on a task-specific reward function.
Researcher Affiliation Academia Geunwoo Kim University of California, Irvine kgw@uci.edu Pierre Baldi University of California, Irvine pfbaldi@ics.uci.edu Stephen Mc Aleer Carnegie Mellon University smcaleer@cs.cmu.edu
Pseudocode No The paper describes the RCI prompting scheme and its application to computer tasks through descriptive text and illustrative figures (e.g., Figure 3), but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Our code can be found here: https://github.com/posgnu/rci-agent.
Open Datasets Yes We evaluate the RCI approach on the Mini Wo B++ benchmark [61]
Dataset Splits No The paper evaluates its method on the Mini Wo B++ benchmark using 'a handful of demonstrations per task' for in-context learning. While it references training data for other approaches and uses a stopping condition for RCI loops on reasoning tasks, it does not explicitly provide details about train/validation/test splits for its own computer task experiments or how data was partitioned for validation purposes in a reproducible manner.
Hardware Specification No All models are accessed through the Open AI API between January 2023 and March 2023.
Software Dependencies No The paper states that 'All models are accessed through the Open AI API between January 2023 and March 2023' and lists API names such as 'gpt-3.5-turbo' and 'gpt-4'. However, it does not provide specific version numbers for ancillary software dependencies or libraries (e.g., Python, PyTorch, TensorFlow) that would be needed to replicate the experimental setup locally.
Experiment Setup Yes For all model usage, a maximum token length of 256 and a temperature value of 0, indicating greedy decoding, are used.