Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control

Authors: Longtao Zheng, Rundong Wang, Xinrun Wang, Bo An

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SYNAPSE on Mini Wo B++, a standard task suite, and Mind2Web, a real-world website benchmark. In Mini Wo B++, SYNAPSE achieves a 99.2% average success rate (a 10% relative improvement) across 64 tasks using demonstrations from only 48 tasks. Notably, SYNAPSE is the first ICL method to solve the book-flight task in Mini Wo B++. SYNAPSE also exhibits a 56% relative improvement in average step success rate over the previous state-of-the-art prompting scheme in Mind2Web.
Researcher Affiliation Collaboration 1Nanyang Technological University, Singapore 2Skywork AI, Singapore
Pseudocode No The paper describes the system's components and process flow through natural language and examples of prompts, but it does not include a formally structured pseudocode or algorithm block.
Open Source Code Yes To ensure reproducibility, all resources such as code, prompts, and agent trajectories have been made publicly available at https://ltzheng.github.io/Synapse.
Open Datasets Yes We evaluate SYNAPSE on two benchmarks: Mini Wo B++ (Shi et al., 2017; Liu et al., 2018), a standard research task suite, and Mind2Web (Deng et al., 2023), a dataset across diverse domains of real-world web navigation.
Dataset Splits No The paper mentions a 'training set' and 'test sets' for the Mind2Web dataset but does not explicitly state a 'validation set' or detailed splits including a validation portion for reproducibility.
Hardware Specification No The paper mentions using specific LLMs like GPT-3.5 (via API) and Code Llama-7B, but it does not provide specific hardware details such as GPU models, CPU types, or cloud computing specifications used for running their experiments.
Software Dependencies Yes In the Mini Wo B++ experiments, we query gpt-3.5-turbo-0301... For Mind2Web, the default LLM is gpt-3.5-turbo-16k-0613. We use text-embedding-ada-002 as the embedding model.
Experiment Setup Yes We configure the temperature to 0, i.e., greedy decoding. ... Specifically, we set k to 3 and 5 for the previous and current observations, respectively. ... We retrieve the top three exemplars from memory and use the most common one to retrieve its exemplars...