Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Language Models can Solve Computer Tasks
Authors: Geunwoo Kim, Pierre Baldi, Stephen McAleer
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the RCI approach on the Mini Wo B++ benchmark [61], and show it surpasses existing SL, RL, and LLM approaches. Furthermore, it proves itself to state-of-the-art compared to existing methods, using only a small number of demonstrations per task instead of tens of thousands, and without relying on a task-specific reward function. |
| Researcher Affiliation | Academia | Geunwoo Kim University of California, Irvine EMAIL Pierre Baldi University of California, Irvine EMAIL Stephen Mc Aleer Carnegie Mellon University EMAIL |
| Pseudocode | No | The paper describes the RCI prompting scheme and its application to computer tasks through descriptive text and illustrative figures (e.g., Figure 3), but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Our code can be found here: https://github.com/posgnu/rci-agent. |
| Open Datasets | Yes | We evaluate the RCI approach on the Mini Wo B++ benchmark [61] |
| Dataset Splits | No | The paper evaluates its method on the Mini Wo B++ benchmark using 'a handful of demonstrations per task' for in-context learning. While it references training data for other approaches and uses a stopping condition for RCI loops on reasoning tasks, it does not explicitly provide details about train/validation/test splits for its own computer task experiments or how data was partitioned for validation purposes in a reproducible manner. |
| Hardware Specification | No | All models are accessed through the Open AI API between January 2023 and March 2023. |
| Software Dependencies | No | The paper states that 'All models are accessed through the Open AI API between January 2023 and March 2023' and lists API names such as 'gpt-3.5-turbo' and 'gpt-4'. However, it does not provide specific version numbers for ancillary software dependencies or libraries (e.g., Python, PyTorch, TensorFlow) that would be needed to replicate the experimental setup locally. |
| Experiment Setup | Yes | For all model usage, a maximum token length of 256 and a temperature value of 0, indicating greedy decoding, are used. |