Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generalizing Experience for Language Agents with Hierarchical MetaFlows

Authors: Shengda Fan, Xin Cong, Zhong Zhang, Yuepeng Fu, Yesai Wu, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on App World and Work Bench demonstrate that integrating with Meta Flow LLM, existing agents (e.g., Re Act, Reflexion) can gain substantial performance improvement with reducing execution costs.
Researcher Affiliation Academia 1Gaoling School of Artificial Intelligence, Renmin University of China 2Department of Statistics and Data Science, Tsinghua University 3Department of Computer Science and Technology, Tsinghua University EMAIL EMAIL
Pseudocode Yes Algorithm 1 Hierarchical Meta Flow Merging
Open Source Code Yes The code is available at https://github.com/RUCBM/Meta Flow LLM.
Open Datasets Yes We conduct experiments on two representative agent datasets: App World [31] and Work Bench [32].
Dataset Splits Yes The dataset statistics are summarized in Table 5. Metric Work Bench App World Offline Data Size 237 90 Test Data Size 353 57 SFT Data Size 1,102 1,092 RL Data Size 121 247
Hardware Specification Yes All experiments are conducted on 8 NVIDIA A800 40G GPUs.
Software Dependencies Yes We use the all-Mini LM-L6-v23 model to encode the task, and the cosine similarity of the task embeddings is utilized as the distance metric. In the SFT stage, we fine-tune the Meta Flow Gen on Qwen2.5-7B-Instruct for 3 epochs using the Adam W optimizer [23] and a linear learning rate scheduler with a peak learning rate of 2 10 5. Each mini-batch contains 32 examples, and the maximum sequence length is set as 8, 192 tokens. In RL stage, we adopt TRL [33] as our training framework.
Experiment Setup Yes We set τ to 1.0 in all experiments to ensure the quality of the experience tree. The value of λ is set to 0.7. In the SFT stage, we fine-tune the Meta Flow Gen on Qwen2.5-7B-Instruct for 3 epochs using the Adam W optimizer [23] and a linear learning rate scheduler with a peak learning rate of 2 10 5. Each mini-batch contains 32 examples, and the maximum sequence length is set as 8, 192 tokens. In RL stage, we adopt TRL [33] as our training framework. We set the training epochs to 2, batch size to 28, learning rate to 1 10 6, KL coefficient to 0 [22], rollout number to 14.