reproducibilityindex.ai

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control

Authors: Longtao Zheng, Rundong Wang, Xinrun Wang, Bo An

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SYNAPSE on Mini Wo B++, a standard task suite, and Mind2Web, a real-world website benchmark. In Mini Wo B++, SYNAPSE achieves a 99.2% average success rate (a 10% relative improvement) across 64 tasks using demonstrations from only 48 tasks. Notably, SYNAPSE is the first ICL method to solve the book-flight task in Mini Wo B++. SYNAPSE also exhibits a 56% relative improvement in average step success rate over the previous state-of-the-art prompting scheme in Mind2Web.
Researcher Affiliation	Collaboration	1Nanyang Technological University, Singapore 2Skywork AI, Singapore
Pseudocode	No	The paper describes the system's components and process flow through natural language and examples of prompts, but it does not include a formally structured pseudocode or algorithm block.
Open Source Code	Yes	To ensure reproducibility, all resources such as code, prompts, and agent trajectories have been made publicly available at https://ltzheng.github.io/Synapse.
Open Datasets	Yes	We evaluate SYNAPSE on two benchmarks: Mini Wo B++ (Shi et al., 2017; Liu et al., 2018), a standard research task suite, and Mind2Web (Deng et al., 2023), a dataset across diverse domains of real-world web navigation.
Dataset Splits	No	The paper mentions a 'training set' and 'test sets' for the Mind2Web dataset but does not explicitly state a 'validation set' or detailed splits including a validation portion for reproducibility.
Hardware Specification	No	The paper mentions using specific LLMs like GPT-3.5 (via API) and Code Llama-7B, but it does not provide specific hardware details such as GPU models, CPU types, or cloud computing specifications used for running their experiments.
Software Dependencies	Yes	In the Mini Wo B++ experiments, we query gpt-3.5-turbo-0301... For Mind2Web, the default LLM is gpt-3.5-turbo-16k-0613. We use text-embedding-ada-002 as the embedding model.
Experiment Setup	Yes	We configure the temperature to 0, i.e., greedy decoding. ... Specifically, we set k to 3 and 5 for the previous and current observations, respectively. ... We retrieve the top three exemplars from memory and use the most common one to retrieve its exemplars...