Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Linear Mixture Distributionally Robust Markov Decision Processes

Authors: Zhishuai Liu, Pan Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We proposed a meta-algorithm for robust policy learning in linear mixture DRMDPs with general f-divergence defined uncertainty sets, and analyze its sample complexities under three divergence metrics instantiations: total variation, Kullback-Leibler, and χ2 divergences. These results establish the statistical learnability of linear mixture DRMDPs, laying the theoretical foundation for future research on this new setting. On the practical side, the proposed meta-algorithm in Algorithm 1 depends on the planning oracle defined in (4.3), rendering it computationally intractable. To demonstrate the practical utility of the linear mixture DRMDP framework, we introduced two computationally tractable algorithms in Appendix E, based on an iterative estimation subroutine, and evaluated them on simple simulated environments in Appendix F. Experimental results showed that the learned policies were robust to environment perturbations, validating the effectiveness of the proposed framework.
Researcher Affiliation Academia Zhishuai Liu Department of Biostatistics & Bioinformatics Duke University Durham, NC 27708 EMAIL Pan Xu Department of Biostatistics & Bioinformatics Duke University Durham, NC 27708 EMAIL
Pseudocode Yes Algorithm 1 Meta Algorithm of Policy Optimization for Linear Mixture DRMDP 1: Input: The offline dataset D, the regularizer λ, and the robust level ρ. 2: Construct the confidence region b P according to (4.2). 3: Get the estimated optimal robust policy ˆπ by (4.3). 4: Return: Policy ˆπ. Algorithm 2 Distributionally Robust Transition (Value) Targeted Regression (DRTTR and DRVTR) Require: Regularization parameter λ, offline dataset D, robust level ρ, initialization b VH+1( ) = 0. 1: for h = H, , 1 do 2: For DRTTR, estimate ˆθh by (4.1); For DRVTR, estimate ˆθh by (E.5). 3: Estimate b Qρ h( , ) using (E.2), (E.3) and (E.4) for TV-, KLand χ2divergences, respectively. 4: πh( ) argmaxa A b Qρ h( , a), b V ρ h ( ) maxa A b Qk,ρ h ( , a) 5: end for
Open Source Code Yes All experiment results can be reproduced by the code in this link: https://anonymous.4open.science/r/Linear-Mixture-DRMDP-8614.
Open Datasets No For the offline dataset collection, we simply use the random policy that chooses actions uniformly at random at any (s, a, h) S A [H] as the behavior policy πb to collect the offline dataset D. The offline dataset D contains 500 trajectories collected by the behavior policy πb from the source environment P 0. For the setting details of the source environment P 0, we set hyperparameters in the defining the source and target environments as ξ = (1/ ξ 1, 1/ ξ 1, 1/ ξ 1, 1/ ξ 1) , ξ 1 = 0.4, p = 0.1, δ = 0.4, and q [0, 1]. We implement Algorithm 2 with TV, KL and χ2 divergences on the collected offline dataset D
Dataset Splits No For the offline dataset collection, we simply use the random policy that chooses actions uniformly at random at any (s, a, h) S A [H] as the behavior policy πb to collect the offline dataset D. The offline dataset D contains 500 trajectories collected by the behavior policy πb from the source environment P 0. ... We test the learned policies on various target environments with different levels of perturbation.
Hardware Specification Yes All experiment results are based on 10 replications, and were conducted on a Mac Book Pro with a 2.6 GHz 6-Core Intel CPU.
Software Dependencies No The paper does not explicitly state the software dependencies with specific version numbers. It mentions implementing algorithms and using a simulation environment, but no details like Python version, specific library versions (e.g., PyTorch, TensorFlow, scikit-learn), or solver versions are provided.
Experiment Setup Yes For the offline dataset collection, we simply use the random policy that chooses actions uniformly at random at any (s, a, h) S A [H] as the behavior policy πb to collect the offline dataset D. The offline dataset D contains 500 trajectories collected by the behavior policy πb from the source environment P 0. For the setting details of the source environment P 0, we set hyperparameters in the defining the source and target environments as ξ = (1/ ξ 1, 1/ ξ 1, 1/ ξ 1, 1/ ξ 1) , ξ 1 = 0.4, p = 0.1, δ = 0.4, and q [0, 1]. ... Denoting the robust levels of the TV, KL and χ2 uncertainty set as ρTV, ρKL, ρχ2, we consider two sets of robust levels: (ρTV, ρKL, ρχ2) {(0.35, 5, 10), (0.7, 10, 20)}. We compare DRTTR and DRVTR with the nonrobust algorithms, dubbed as the TTR and VTR respectively, which basically set the robust level ρ = 0 in DRTTR and DRVTR.