reproducibilityindex.ai

Steady-State Policy Synthesis in Multichain Markov Decision Processes

Authors: George Atia, Andre Beckus, Ismail Alkhouri, Alvaro Velasquez

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We ﬁrst run our proposed LP (5) to calculate the steady-state distribution Pr π (s) for the Frozen Island example shown in Figure 1. The values for the two recurrent sets (the two islands) are shown in Figure 3. The heat map provides insight into the way in which the agent meets the speciﬁcations. Once entering the islands, the agent spends a large proportion of its time in states s33, s36, s48, s49, s61, and s64, in the sense of average expected number of visits. The agent satisﬁes constraints (Llog1, [0.25, 1]) and (Llog2, [0.25, 1]) largely by visiting states s36 and s61, respectively. Likewise, the agent meets constraints (Lcanoe1, [0.05, 1.0]), (Lcanoe2, [0.05, 1.0]) (Lﬁsh1, [0.1, 1.0]), (Lﬁsh2, [0.1, 1.0]) by visiting states s33, s49, s48, and s64, respectively. While satisfying the constraints, the agent maximizes the accumulated reward by visiting state s48 over 25% of the time. Figure 4 (left) shows the values of Pr π (s) along with the optimal values x s obtained from LP (5), demonstrating that the steady-state distribution matches that estimated by the LP for every state. This holds also for all state-action pairs, i.e., Pr = x . This condition is critical to the proof of Theorem 2, and ensures that the policy is both optimal and meets the steady-state speciﬁcations. For comparison, we solve LP (4.7.6) from [Kallenberg, 1983] to obtain optimal values x , y , then form the corresponding policy π. Results are shown in Figure 4 (right). As with Example 1, the produced policy fails to yield a steady-state distribution equal to x . Table 1 demonstrates the consequences when Pr π = x . For each speciﬁcation (Li, [l, u]) Φ L , Table 1 shows e T x Li and Pr π (Li) := P s Li Pr π (s). For the proposed LP, all of the speciﬁcations are met. However, for Kallenberg s formulation, although x Llog1 and x Lcanoe1 meet the speciﬁcation, the policy yields steady-state distributions Pr π (Llog1) and Pr π (Lcanoe1) which violate the speciﬁcations (the violations are highlighted with bold red text). Additionally, we show the optimal reward R output by the LP, as well as the average expected reward of the obtained policy R π := P s S P a A(s) Pr π (s, a)R(s, a). Although R obtained by Kallenberg s formulation is larger than that of the proposed LP, the proposed LP produces a policy which gives a larger R . Finally, we execute the policies to verify the validity of the formulations and to further demonstrate the failure of Kallenberg s formulation to meet speciﬁcations and yield optimal rewards. Deﬁne St and At as the state and action, respectively, of the Frozen Island example at time t assuming initial distribution β and policy π. The average number of visits fπ,n and average reward gπ,n up to time n are deﬁned as fπ,n(L) = 1 n n P t=1 1L(St), 1L(s) = 1 s L 0 s / L (11) gπ,n = 1 n n P t=1 R(St, At, St+1) (12) For fπ,n(L) and gπ,n, we take an ensemble average over 5000 paths. In Figure 5 (left), the solid and dotted lines show the average number of visits to the states labeled as logs, the dashed and dash-dotted lines indicate the steady-state distributions, and the square markers show the speciﬁcation lower bound for the logs. For the proposed LP, fπ,n(Llog1) con- verges to Pr π (Llog1) and meets the speciﬁcation. For Kallenberg s policy, although fπ,n(Llog1) converges to Pr π (Llog1), it fails to meet the speciﬁcations for the reason described earlier. In Figure 5 (right), the solid and dotted lines show the average reward, and the dashed and dash-dotted lines indicate the reward which was output by the LP. While the proposed LP converges to the LP reward, as described earlier, Kallenberg s formulation converges to a different reward. We also demonstrate the scalability of the proposed LP by running CPLEX 12.8 simulations on a standard desktop computer with 128GB of RAM for random instances of the Frozen Island problem. See Table 2 for runtime results.
Researcher Affiliation	Collaboration	George Atia1 , Andre Beckus1 , Ismail Alkhouri1 and Alvaro Velasquez2 1Department of Electrical and Computer Engineering, University of Central Florida 2Information Directorate, Air Force Research Laboratory george.atia@ucf.edu, {abeckus,ialkhouri}@knights.ucf.edu, alvaro.velasquez.1@us.af.mil
Pseudocode	No	The paper does not include a clearly labeled pseudocode or algorithm block. It presents mathematical formulations.
Open Source Code	No	The paper does not provide any concrete access (link, explicit statement of release) to the source code for the methodology described.
Open Datasets	Yes	This Frozen Island scenario is similar to those found in Open AI Gym s Frozen Lake environment [Brockman et al., 2016].
Dataset Splits	No	The paper does not specify explicit train/validation/test dataset splits. It describes an 'initial uniform distribution' and an 'ensemble average over 5000 paths' for evaluation, but not data partitioning for model training and validation.
Hardware Specification	Yes	We also demonstrate the scalability of the proposed LP by running CPLEX 12.8 simulations on a standard desktop computer with 128GB of RAM for random instances of the Frozen Island problem.
Software Dependencies	Yes	We also demonstrate the scalability of the proposed LP by running CPLEX 12.8 simulations on a standard desktop computer with 128GB of RAM for random instances of the Frozen Island problem.
Experiment Setup	Yes	The agent begins on the larger island of size n n/2 with an initial uniform distribution over those states, i.e., β(s) = 2/n2 for every state belonging to the large island. ... For each island, we have (Llog1, [0.25, 1]), (Llog2, [0.25, 1]), (Lcanoe1, [0.05, 1.0]), (Lcanoe2, [0.05, 1.0]), (Lﬁsh1, [0.1, 1.0]), (Lﬁsh2, [0.1, 1.0]), R( , , Lﬁsh1) = R( , , Lﬁsh2) = 1, R( , , S\(Lﬁsh1 Lﬁsh2)) = 0. ... Namely, if the agent chooses to go right (left), there is a 90% chance that it will transition to the right (left), and the chance of transitioning to either of the states above or below it is 5%. Similarly, if the agent chooses to up (down), it will end up in the states above (below) it with 90% chance, and in the states to the right and left of it with chance 5% each. ... For these experiments, we have the constraints (Llog1 Llog2, [0.3, 1]), (Lcanoe1 Lcanoe2, [0.05, 1]) and reward function R( , , Lﬁsh1 Lﬁsh2) = 1, R( , , S\Lﬁsh1 Lﬁsh2) = 0.