Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A-Mem: Agentic Memory for LLM Agents

Authors: Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. Code for Benchmark Evaluation: https://github.com/Wujiang Xu/Agentic Memory
Researcher Affiliation Collaboration Wujiang Xu1, Zujie Liang2, Kai Mei1, Hang Gao1, Juntao Tan1, Yongfeng Zhang1,3 1Rutgers University 2Independent Researcher 3AIOS Foundation EMAIL
Pseudocode No The paper describes its methodology using textual descriptions and mathematical equations (1-10) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code for Benchmark Evaluation: https://github.com/Wujiang Xu/Agentic Memory Code for Production-ready Agentic Memory: https://github.com/Wujiang Xu/A-mem-sys
Open Datasets Yes To evaluate the effectiveness of instruction-aware recommendation in long-term conversations, we utilize the Lo Co Mo dataset [22], which contains significantly longer dialogues compared to existing conversational datasets [36, 13]. [...] Besides, we use a new dataset, named Dial Sim [16], to evaluate the effectiveness of our memory system.
Dataset Splits No The paper describes the characteristics and contents of the Lo Co Mo and Dial Sim datasets, including question types and total question-answer pairs, but does not specify how these datasets were split into training, validation, or test sets for the experiments.
Hardware Specification No Processing times average 5.4 seconds using GPT-4o-mini and only 1.1 seconds with locally-hosted Llama 3.2 1B on a single GPU.
Software Dependencies No The deployment of Qwen-1.5B/3B and Llama 3.2 1B/3B models is accomplished through local instantiation using Ollama 1, with Lite LLM 2 managing structured output generation. For GPT models, we utilize the official structured output API. ... For text embedding, we implement the all-minilm-l6-v2 model across all experiments.
Experiment Setup Yes For all baselines and our proposed method, we maintain consistency by employing identical system prompts as detailed in Appendix B. ... In our memory retrieval process, we primarily employ k=10 for top-k memory selection to maintain computational efficiency, while adjusting this parameter for specific categories to optimize performance. The detailed configurations of k can be found in Appendix A.5.