Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generative Caching for Structurally Similar Prompts and Responses

Authors: Sarthak Chakraborty, Suman Nath, Xuchao Zhang, Chetan Bansal, Indranil Gupta

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare our method with synthetic and benchmark datasets, and report our results by integrating it with two agent frameworks that perform repetitive tasks. Our evaluations on the Webshop dataset [53] demonstrate over 83% cache hit rate and at least 35% cost savings. When integrated with existing AI agents, it reduces the end-to-end execution latency and achieves 20% higher hit rate. Experiments: Setup: We implement Gen Cache as a library exposing an API interface through which clients issue input prompts. We evaluate Gen Cache in two scenarios: (1) standalone user prompts ( 4.1, 4.2) using synthetic data, and (2) integrate with two AI agents ( 4.3).
Researcher Affiliation	Collaboration	Sarthak Chakraborty1 Suman Nath2 Xuchao Zhang2 Chetan Bansal2 Indranil Gupta1 1University of Illinois at Urbana-Champaign, 2Microsoft Research EMAIL, EMAIL
Pseudocode	Yes	Figure 2b shows the prompt for Code Gen LLM and the generated program. import re import sys def func(prompt): try: user_instr = re.search(...) ------------------------------------------------------------------------print(f"The tenant name is {tenant_name}") except: print("None: Invalid prompt") if __name__ == "__main__": if len(sys.argv) != 2: print("None: Invalid args") else: prompt = sys.argv[1] tenant_name = func(prompt)
Open Source Code	Yes	Code link for Gen Cache is available at https://github.com/sarthak-chakraborty/Gen Cache
Open Datasets	Yes	For experiments with Laser [8], we use the Web Shop dataset [53], a simulated e-commerce environment featuring 12000+ crowd-sourced user instructions. [53] Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 20744 20757. Curran Associates, Inc., 2022.
Dataset Splits	No	The paper mentions using 10,000 synthetic input prompts and 5,000 input prompts for each dataset type for evaluation, and also states that the first ν (default 4) input prompts are used to populate clusters. However, it does not specify explicit training, testing, or validation splits for the datasets themselves in a traditional sense, nor does it provide citations to predefined splits.
Hardware Specification	Yes	We run our experiments on cloud servers with 8-core Intel Xeon CPU and 64 GB memory.
Software Dependencies	No	The paper mentions using GPT-4o, GPT-4, Sentence Transformer, FAISS, Lang Chain, and Open AI Chat Completion. However, it does not provide specific version numbers for these software components or the programming language (e.g., Python version) used to implement the system. While the supplementary material states that specific versions will be included in a requirements file, this information is not present in the main paper text itself.
Experiment Setup	Yes	We use GPT-4o for Code Gen LLM and Valid LLM, with default parameters ρ = 30, γ = 50%, T p = 0.8, and T r = 0.75, unless stated otherwise. Each experiment starts with an empty Cache Store and Cluster Database. The first ν input prompts (default ν = 4) are therefore used to populate the clusters before caching becomes possible.