Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-agent Markov Entanglement

Authors: Shuze Chen, Tianyi Peng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we empirically study the value decomposition for the index policy on a circulant RMAB benchmark [3, 52, 10, 18] that has 4 different states each local agent. As a result, the global state space scales as large as 41800 > 101000 for N = 1800 agents. The specific transitions and rewards are introduced in Appendix K. For each RMAB instance, we sample a trajectory of length T = 5N and use the collected data to i) solve Eq. (9) to estimate the measure of Markov entanglement; ii) train local Q-value decomposition. It quickly follows from the results in Figure 1: The estimated Markov entanglement decays as O( N) in the left panel, consistent with theoretical predictions. This also implies a low decomposition error scaling of O( N), as seen in the right panel. Furthermore, the simulated trajectory has a length of T = 5N while the global state space has size |S|N, making both entanglement estimation and local Q-value decomposition sample-efficient.
Researcher Affiliation Academia Shuze Chen Graduate School of Business Columbia University New York, NY 10027 EMAIL Tianyi Peng Graduate School of Business Columbia University New York, NY 10027 EMAIL
Pseudocode Yes Meta Algorithm 1: Leaning Local Q-value Functions Require: Global policy ฯ€; horizon length T. 1: Execute ฯ€ for T epochs and obtain D = (st AB, at AB, rt AB, st+1 AB , at+1 AB ) T 1 t=1 . 2: Each agent i {A, B} fits Qฯ€ i using local observations Di = (st i, at i, rt i, st+1 i , at+1 i ) T 1 t=1 .
Open Source Code Yes We upload the codes and instructions to recover the results.
Open Datasets Yes Our simulations build on a circulant RMAB benchmark, which is widely used in the literature [3, 52, 10, 18].
Dataset Splits Yes For each RMAB instance, we simulate a trajectory of length T = 6N and collect data for the later 5N epochs. ... During testing, we estimate the ยต-weighted decomposition error using 50 simulations sampled from the stationary distribution.
Hardware Specification Yes Finally, we use standard convex optimization solvers to compute Markov entanglement, which can be run efficiently on a single CPU.
Software Dependencies No The paper mentions 'standard convex optimization solvers' but does not specify any particular software names or version numbers.
Experiment Setup Yes We set the discount factor to ฮณ = 0.5 and require N/5 arms to be pulled per period. Initially, there are N/6 arms in state 0, N/3 arms in state 1 and N/2 arms in state 2, the same as [52]. We then test an index policy with priority: state 2 > state 1 > state 0 > state 3.