Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

Authors: Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that models trained with synthesized trajectories significantly improve performance. Compared to traditional human-annotated data pipelines, our method is more cost-effective, highlighting the scalability and economic viability of the Agent Trek approach. Our comprehensive experiments demonstrate that agents trained with Agent Trek s synthesized data significantly outperform those trained on existing datasets across multiple benchmarks, showing marked improvements in both textual and visualweb browsing capabilities.
Researcher Affiliation Collaboration University of Hong Kong Salesforce Research EMAIL EMAIL
Pseudocode No The paper describes methods in structured prose and flowcharts (e.g., Figure 2, Figure 3, Figure 5) and includes example prompts, but it does not present any formal pseudocode blocks or algorithms.
Open Source Code No The paper includes a URL 'https://agenttrek.github.io' but this is a project homepage and not explicitly stated as a code repository. No direct statement about releasing code for the described methodology is present.
Open Datasets Yes We extract web interaction tutorials from the Red Pajama dataset (Computer, 2023). We use Web Arena (Zhou et al., 2023) as our primary benchmark. First, Screen Spot (Cheng et al., 2024) provides a GUI visual grounding benchmark. Multimodal-Mind2Web (Deng et al., 2024; Zheng et al., 2024) extends the Mind2Web benchmark.
Dataset Splits Yes Using a 95:5 train-test split, we trained the Fast Text model which achieved 89.5% F1 score on the validation set. We fine-tuned the model using 10,000 trajectories from the Agent Trek dataset. For text-based agents, we fine-tune Qwen2.5 LLMs (Qwen et al., 2025) at various parameter scales (7B and 32B) using 6,000 agent trajectories from the Agent Trek dataset.
Hardware Specification No The paper mentions 'Executing 1,000 tasks with GPT-4O-08-06 incurs a cost of approximately $215' which indicates the use of an API, but it does not provide specific hardware details (e.g., GPU models, CPU models, memory) used for running their experiments or training their models.
Software Dependencies No The paper mentions several software components and models such as 'GPT-4O MINI', 'Fast Text model', 'Browser Gym', 'GPT-4O', 'Qwen2-VL', 'Qwen2.5 LLMs', 'Playwright', and 'pyautogui'. While 'GPT-4O-08-06' is mentioned in the cost analysis, a specific version is not provided for multiple other key software components used in their methodology.
Experiment Setup No The paper states 'We fine-tune the model using 10,000 trajectories from the Agent Trek dataset' and 'For text-based agents, we fine-tune Qwen2.5 LLMs [...] using 6,000 agent trajectories'. However, it lacks specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, number of epochs) or optimizer settings, which are crucial for reproducibility.