Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

Authors: Vishnu Sarukkai, Zhiqiang Xie, Kayvon Fatahalian

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our database construction methods through experiments addressing three key questions: Database scaling: How does task success rate scale with increasing database size? Improving database construction: How much do population-based training and exemplar- level curation improve task success rate? Overall effectiveness: How do our approaches compare to alternative approaches leveraging task-specific domain knowledge or hierarchical algorithms?
Researcher Affiliation Academia Vishnu Sarukkai Stanford University Zhiqiang Xie Stanford University Kayvon Fatahalian Stanford University
Pseudocode Yes Algorithm 1 Re Act-style Agent Loop... Algorithm 2 Database Curation Logic for +DB-Curation... Algorithm 3 Database Construction from Top Exemplars for +Exemplar-Curation... Algorithm 4 Multi-key Retrieval
Open Source Code Yes Code is attached in Supplemental, and will be made publicly available. Benchmarks are publicly available.
Open Datasets Yes We evaluate our methods on three benchmarks: ALFWorld [37], a text-based environment for navigation and object manipulation; Inter Code-SQL [38], an interactive coding environment for SQL query generation; and Wordcraft [39], a simplified adaptation of Little Alchemy requiring compositional reasoning to combine elements.
Dataset Splits Yes ALFWorld ... 3500 training tasks and 134 out-of-distribution test tasks... Inter Code-SQL ... Of the 1034 tasks in the dataset, we randomly assign 800 tasks to train and the remaining 234 tasks to test... Wordcraft ... We randomly select 4000 training tasks and 500 test tasks from the subset of tasks requiring up to 2 steps to solve, with the train-test split separating the tasks into disjoint sets of goal elements.
Hardware Specification Yes All experiments were conducted using the following computational resources: 1 NVIDIA A5000 GPU (24GB memory) for embedding computation 64GB RAM
Software Dependencies No For embedding computations, we used all-Mini LM-L6-v2 [43]. For LLM inference, we used the Open AI API for GPT-4o-mini... The retrieval mechanism is implemented using FAISS [44] for efficient similarity search as the database grows.
Experiment Setup Yes Unless otherwise specified, we use GPT-4o-mini as our base LLM (temperature 0.1). For Fixed-DB and all Traj-Bootstrap agents, we retrieve the top-k most similar trajectories at each decision step (k = 6 for ALFWorld and Inter Code-SQL, 10 for Wordcraft). We initialize each database with a small human-provided example set (18 for ALFWorld, 10 for Inter Code-SQL, 4 for Wordcraft). With +DB-Curation, we maintain N = 5 database instances with curation every time the database size is doubled, starting with a minimum size of ten trajectories. We report success rates averaged over five random seeds.