Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Generalizing Experience for Language Agents with Hierarchical MetaFlows
Authors: Shengda Fan, Xin Cong, Zhong Zhang, Yuepeng Fu, Yesai Wu, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on App World and Work Bench demonstrate that integrating with Meta Flow LLM, existing agents (e.g., Re Act, Reflexion) can gain substantial performance improvement with reducing execution costs. |
| Researcher Affiliation | Academia | 1Gaoling School of Artificial Intelligence, Renmin University of China 2Department of Statistics and Data Science, Tsinghua University 3Department of Computer Science and Technology, Tsinghua University EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Hierarchical Meta Flow Merging |
| Open Source Code | Yes | The code is available at https://github.com/RUCBM/Meta Flow LLM. |
| Open Datasets | Yes | We conduct experiments on two representative agent datasets: App World [31] and Work Bench [32]. |
| Dataset Splits | Yes | The dataset statistics are summarized in Table 5. Metric Work Bench App World Offline Data Size 237 90 Test Data Size 353 57 SFT Data Size 1,102 1,092 RL Data Size 121 247 |
| Hardware Specification | Yes | All experiments are conducted on 8 NVIDIA A800 40G GPUs. |
| Software Dependencies | Yes | We use the all-Mini LM-L6-v23 model to encode the task, and the cosine similarity of the task embeddings is utilized as the distance metric. In the SFT stage, we fine-tune the Meta Flow Gen on Qwen2.5-7B-Instruct for 3 epochs using the Adam W optimizer [23] and a linear learning rate scheduler with a peak learning rate of 2 10 5. Each mini-batch contains 32 examples, and the maximum sequence length is set as 8, 192 tokens. In RL stage, we adopt TRL [33] as our training framework. |
| Experiment Setup | Yes | We set τ to 1.0 in all experiments to ensure the quality of the experience tree. The value of λ is set to 0.7. In the SFT stage, we fine-tune the Meta Flow Gen on Qwen2.5-7B-Instruct for 3 epochs using the Adam W optimizer [23] and a linear learning rate scheduler with a peak learning rate of 2 10 5. Each mini-batch contains 32 examples, and the maximum sequence length is set as 8, 192 tokens. In RL stage, we adopt TRL [33] as our training framework. We set the training epochs to 2, batch size to 28, learning rate to 1 10 6, KL coefficient to 0 [22], rollout number to 14. |