Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
Authors: Longtao Zheng, Rundong Wang, Xinrun Wang, Bo An
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SYNAPSE on Mini Wo B++, a standard task suite, and Mind2Web, a real-world website benchmark. In Mini Wo B++, SYNAPSE achieves a 99.2% average success rate (a 10% relative improvement) across 64 tasks using demonstrations from only 48 tasks. Notably, SYNAPSE is the first ICL method to solve the book-flight task in Mini Wo B++. SYNAPSE also exhibits a 56% relative improvement in average step success rate over the previous state-of-the-art prompting scheme in Mind2Web. |
| Researcher Affiliation | Collaboration | 1Nanyang Technological University, Singapore 2Skywork AI, Singapore |
| Pseudocode | No | The paper describes the system's components and process flow through natural language and examples of prompts, but it does not include a formally structured pseudocode or algorithm block. |
| Open Source Code | Yes | To ensure reproducibility, all resources such as code, prompts, and agent trajectories have been made publicly available at https://ltzheng.github.io/Synapse. |
| Open Datasets | Yes | We evaluate SYNAPSE on two benchmarks: Mini Wo B++ (Shi et al., 2017; Liu et al., 2018), a standard research task suite, and Mind2Web (Deng et al., 2023), a dataset across diverse domains of real-world web navigation. |
| Dataset Splits | No | The paper mentions a 'training set' and 'test sets' for the Mind2Web dataset but does not explicitly state a 'validation set' or detailed splits including a validation portion for reproducibility. |
| Hardware Specification | No | The paper mentions using specific LLMs like GPT-3.5 (via API) and Code Llama-7B, but it does not provide specific hardware details such as GPU models, CPU types, or cloud computing specifications used for running their experiments. |
| Software Dependencies | Yes | In the Mini Wo B++ experiments, we query gpt-3.5-turbo-0301... For Mind2Web, the default LLM is gpt-3.5-turbo-16k-0613. We use text-embedding-ada-002 as the embedding model. |
| Experiment Setup | Yes | We configure the temperature to 0, i.e., greedy decoding. ... Specifically, we set k to 3 and 5 for the previous and current observations, respectively. ... We retrieve the top three exemplars from memory and use the most common one to retrieve its exemplars... |