Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-agent Markov Entanglement

Authors: Shuze Chen, Tianyi Peng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we empirically study the value decomposition for the index policy on a circulant RMAB benchmark [3, 52, 10, 18] that has 4 different states each local agent. As a result, the global state space scales as large as 41800 > 101000 for N = 1800 agents. The specific transitions and rewards are introduced in Appendix K. For each RMAB instance, we sample a trajectory of length T = 5N and use the collected data to i) solve Eq. (9) to estimate the measure of Markov entanglement; ii) train local Q-value decomposition. It quickly follows from the results in Figure 1: The estimated Markov entanglement decays as O( N) in the left panel, consistent with theoretical predictions. This also implies a low decomposition error scaling of O( N), as seen in the right panel. Furthermore, the simulated trajectory has a length of T = 5N while the global state space has size \|S\|N, making both entanglement estimation and local Q-value decomposition sample-efficient.
Researcher Affiliation	Academia	Shuze Chen Graduate School of Business Columbia University New York, NY 10027 EMAIL Tianyi Peng Graduate School of Business Columbia University New York, NY 10027 EMAIL
Pseudocode	Yes	Meta Algorithm 1: Leaning Local Q-value Functions Require: Global policy π; horizon length T. 1: Execute π for T epochs and obtain D = (st AB, at AB, rt AB, st+1 AB , at+1 AB ) T 1 t=1 . 2: Each agent i {A, B} fits Qπ i using local observations Di = (st i, at i, rt i, st+1 i , at+1 i ) T 1 t=1 .
Open Source Code	Yes	We upload the codes and instructions to recover the results.
Open Datasets	Yes	Our simulations build on a circulant RMAB benchmark, which is widely used in the literature [3, 52, 10, 18].
Dataset Splits	Yes	For each RMAB instance, we simulate a trajectory of length T = 6N and collect data for the later 5N epochs. ... During testing, we estimate the µ-weighted decomposition error using 50 simulations sampled from the stationary distribution.
Hardware Specification	Yes	Finally, we use standard convex optimization solvers to compute Markov entanglement, which can be run efficiently on a single CPU.
Software Dependencies	No	The paper mentions 'standard convex optimization solvers' but does not specify any particular software names or version numbers.
Experiment Setup	Yes	We set the discount factor to γ = 0.5 and require N/5 arms to be pulled per period. Initially, there are N/6 arms in state 0, N/3 arms in state 1 and N/2 arms in state 2, the same as [52]. We then test an index policy with priority: state 2 > state 1 > state 0 > state 3.