Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generalizing Experience for Language Agents with Hierarchical MetaFlows

Authors: Shengda Fan, Xin Cong, Zhong Zhang, Yuepeng Fu, Yesai Wu, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results on App World and Work Bench demonstrate that integrating with Meta Flow LLM, existing agents (e.g., Re Act, Reflexion) can gain substantial performance improvement with reducing execution costs.
Researcher Affiliation	Academia	1Gaoling School of Artificial Intelligence, Renmin University of China 2Department of Statistics and Data Science, Tsinghua University 3Department of Computer Science and Technology, Tsinghua University EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Hierarchical Meta Flow Merging
Open Source Code	Yes	The code is available at https://github.com/RUCBM/Meta Flow LLM.
Open Datasets	Yes	We conduct experiments on two representative agent datasets: App World [31] and Work Bench [32].
Dataset Splits	Yes	The dataset statistics are summarized in Table 5. Metric Work Bench App World Offline Data Size 237 90 Test Data Size 353 57 SFT Data Size 1,102 1,092 RL Data Size 121 247
Hardware Specification	Yes	All experiments are conducted on 8 NVIDIA A800 40G GPUs.
Software Dependencies	Yes	We use the all-Mini LM-L6-v23 model to encode the task, and the cosine similarity of the task embeddings is utilized as the distance metric. In the SFT stage, we fine-tune the Meta Flow Gen on Qwen2.5-7B-Instruct for 3 epochs using the Adam W optimizer [23] and a linear learning rate scheduler with a peak learning rate of 2 10 5. Each mini-batch contains 32 examples, and the maximum sequence length is set as 8, 192 tokens. In RL stage, we adopt TRL [33] as our training framework.
Experiment Setup	Yes	We set τ to 1.0 in all experiments to ensure the quality of the experience tree. The value of λ is set to 0.7. In the SFT stage, we fine-tune the Meta Flow Gen on Qwen2.5-7B-Instruct for 3 epochs using the Adam W optimizer [23] and a linear learning rate scheduler with a peak learning rate of 2 10 5. Each mini-batch contains 32 examples, and the maximum sequence length is set as 8, 192 tokens. In RL stage, we adopt TRL [33] as our training framework. We set the training epochs to 2, batch size to 28, learning rate to 1 10 6, KL coefficient to 0 [22], rollout number to 14.