Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fast Inference for Augmented Large Language Models

Authors: Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement MARS on top of v LLM and evaluate its performance against baseline LLM inference systems, demonstrating improvements in end-to-end latency by 27%-85% and reductions in TTFT by 4%-96% compared to the existing augmented-LLM system, with even greater gains over v LLM. Our primary evaluation metrics are end-to-end latency (the time from when a request is submitted to the system until its completion) and time-to-first-token (TTFT) (both mean and P99) across three datasets and four model sizes, comparing MARS s performance against INFERCEPT (Abhyankar et al., 2024) and vanilla v LLM (Kwon et al., 2023).
Researcher Affiliation Collaboration Rana Shahout Harvard University Cong Liang Tsinghua University Shiji Xin Harvard University Qianru Lao Harvard University Yong Cui Tsinghua University Minlan Yu Harvard University Michael Mitzenmacher Harvard University
Pseudocode Yes The pseudocode of the MARS scheduler is provided in Algorithm 1 in Appendix B.
Open Source Code Yes Our implementation is available online (code, [n. d.]). In the references: MARS code. [n. d.]. MARS implementation. https://github.com/mars-repository/ mars-codebase.
Open Datasets Yes We evaluate our system using three datasets. The first two, based on INFERCEPT. The single-API dataset is a subset containing only a single API, while the multi-API dataset is the full INFERCEPT dataset. The third dataset, Tool Bench (Qin et al., 2023), is an instruction-tuning dataset for tool-use tasks, featuring over 16,000 real-world APIs across 49 categories.
Dataset Splits Yes We train the model using an 80-20 split for training and validation, classifying output lengths into bins. We apply this model specifically to the Tool Bench dataset... MARS is evaluated using the test portion of the Tool Bench data to ensure accuracy.
Hardware Specification Yes Testbed. We used a machine with dual AMD EPYC 7313 CPUs (16 cores each, 64 threads total), 503 GB RAM, and two NVIDIA A100 GPUs (80 GB each) connected via NVLink.
Software Dependencies No The paper mentions MARS is implemented on top of 'v LLM (Kwon et al., 2023)' and uses the 'OPT-125M model (Zhang et al., 2022)' for predictions, but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes GPU memory usage was capped at 40 GB to match INFERCEPT s setup. Llama 70B was served using vLLM s default tensor parallelism (set to 2) across the two GPUs. Parameter experiments led us to set the predefined threshold at 100 (testing with the datasets in Section 4). Our primary evaluation metrics are end-to-end latency (...) and time-to-first-token (TTFT) (...) across three datasets and four model sizes, comparing MARS s performance (...) at a request rate of 5.