Language Models can Solve Computer Tasks
Authors: Geunwoo Kim, Pierre Baldi, Stephen McAleer
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the RCI approach on the Mini Wo B++ benchmark [61], and show it surpasses existing SL, RL, and LLM approaches. Furthermore, it proves itself to state-of-the-art compared to existing methods, using only a small number of demonstrations per task instead of tens of thousands, and without relying on a task-specific reward function. |
| Researcher Affiliation | Academia | Geunwoo Kim University of California, Irvine kgw@uci.edu Pierre Baldi University of California, Irvine pfbaldi@ics.uci.edu Stephen Mc Aleer Carnegie Mellon University smcaleer@cs.cmu.edu |
| Pseudocode | No | The paper describes the RCI prompting scheme and its application to computer tasks through descriptive text and illustrative figures (e.g., Figure 3), but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Our code can be found here: https://github.com/posgnu/rci-agent. |
| Open Datasets | Yes | We evaluate the RCI approach on the Mini Wo B++ benchmark [61] |
| Dataset Splits | No | The paper evaluates its method on the Mini Wo B++ benchmark using 'a handful of demonstrations per task' for in-context learning. While it references training data for other approaches and uses a stopping condition for RCI loops on reasoning tasks, it does not explicitly provide details about train/validation/test splits for its own computer task experiments or how data was partitioned for validation purposes in a reproducible manner. |
| Hardware Specification | No | All models are accessed through the Open AI API between January 2023 and March 2023. |
| Software Dependencies | No | The paper states that 'All models are accessed through the Open AI API between January 2023 and March 2023' and lists API names such as 'gpt-3.5-turbo' and 'gpt-4'. However, it does not provide specific version numbers for ancillary software dependencies or libraries (e.g., Python, PyTorch, TensorFlow) that would be needed to replicate the experimental setup locally. |
| Experiment Setup | Yes | For all model usage, a maximum token length of 256 and a temperature value of 0, indicating greedy decoding, are used. |