Steady-State Policy Synthesis in Multichain Markov Decision Processes

Authors: George Atia, Andre Beckus, Ismail Alkhouri, Alvaro Velasquez

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first run our proposed LP (5) to calculate the steady-state distribution Pr π (s) for the Frozen Island example shown in Figure 1. The values for the two recurrent sets (the two islands) are shown in Figure 3. The heat map provides insight into the way in which the agent meets the specifications. Once entering the islands, the agent spends a large proportion of its time in states s33, s36, s48, s49, s61, and s64, in the sense of average expected number of visits. The agent satisfies constraints (Llog1, [0.25, 1]) and (Llog2, [0.25, 1]) largely by visiting states s36 and s61, respectively. Likewise, the agent meets constraints (Lcanoe1, [0.05, 1.0]), (Lcanoe2, [0.05, 1.0]) (Lfish1, [0.1, 1.0]), (Lfish2, [0.1, 1.0]) by visiting states s33, s49, s48, and s64, respectively. While satisfying the constraints, the agent maximizes the accumulated reward by visiting state s48 over 25% of the time. Figure 4 (left) shows the values of Pr π (s) along with the optimal values x s obtained from LP (5), demonstrating that the steady-state distribution matches that estimated by the LP for every state. This holds also for all state-action pairs, i.e., Pr = x . This condition is critical to the proof of Theorem 2, and ensures that the policy is both optimal and meets the steady-state specifications. For comparison, we solve LP (4.7.6) from [Kallenberg, 1983] to obtain optimal values x , y , then form the corresponding policy π. Results are shown in Figure 4 (right). As with Example 1, the produced policy fails to yield a steady-state distribution equal to x . Table 1 demonstrates the consequences when Pr π = x . For each specification (Li, [l, u]) Φ L , Table 1 shows e T x Li and Pr π (Li) := P s Li Pr π (s). For the proposed LP, all of the specifications are met. However, for Kallenberg s formulation, although x Llog1 and x Lcanoe1 meet the specification, the policy yields steady-state distributions Pr π (Llog1) and Pr π (Lcanoe1) which violate the specifications (the violations are highlighted with bold red text). Additionally, we show the optimal reward R output by the LP, as well as the average expected reward of the obtained policy R π := P s S P a A(s) Pr π (s, a)R(s, a). Although R obtained by Kallenberg s formulation is larger than that of the proposed LP, the proposed LP produces a policy which gives a larger R . Finally, we execute the policies to verify the validity of the formulations and to further demonstrate the failure of Kallenberg s formulation to meet specifications and yield optimal rewards. Define St and At as the state and action, respectively, of the Frozen Island example at time t assuming initial distribution β and policy π. The average number of visits fπ,n and average reward gπ,n up to time n are defined as fπ,n(L) = 1 n n P t=1 1L(St), 1L(s) = 1 s L 0 s / L (11) gπ,n = 1 n n P t=1 R(St, At, St+1) (12) For fπ,n(L) and gπ,n, we take an ensemble average over 5000 paths. In Figure 5 (left), the solid and dotted lines show the average number of visits to the states labeled as logs, the dashed and dash-dotted lines indicate the steady-state distributions, and the square markers show the specification lower bound for the logs. For the proposed LP, fπ,n(Llog1) con- verges to Pr π (Llog1) and meets the specification. For Kallenberg s policy, although fπ,n(Llog1) converges to Pr π (Llog1), it fails to meet the specifications for the reason described earlier. In Figure 5 (right), the solid and dotted lines show the average reward, and the dashed and dash-dotted lines indicate the reward which was output by the LP. While the proposed LP converges to the LP reward, as described earlier, Kallenberg s formulation converges to a different reward. We also demonstrate the scalability of the proposed LP by running CPLEX 12.8 simulations on a standard desktop computer with 128GB of RAM for random instances of the Frozen Island problem. See Table 2 for runtime results.
Researcher Affiliation Collaboration George Atia1 , Andre Beckus1 , Ismail Alkhouri1 and Alvaro Velasquez2 1Department of Electrical and Computer Engineering, University of Central Florida 2Information Directorate, Air Force Research Laboratory george.atia@ucf.edu, {abeckus,ialkhouri}@knights.ucf.edu, alvaro.velasquez.1@us.af.mil
Pseudocode No The paper does not include a clearly labeled pseudocode or algorithm block. It presents mathematical formulations.
Open Source Code No The paper does not provide any concrete access (link, explicit statement of release) to the source code for the methodology described.
Open Datasets Yes This Frozen Island scenario is similar to those found in Open AI Gym s Frozen Lake environment [Brockman et al., 2016].
Dataset Splits No The paper does not specify explicit train/validation/test dataset splits. It describes an 'initial uniform distribution' and an 'ensemble average over 5000 paths' for evaluation, but not data partitioning for model training and validation.
Hardware Specification Yes We also demonstrate the scalability of the proposed LP by running CPLEX 12.8 simulations on a standard desktop computer with 128GB of RAM for random instances of the Frozen Island problem.
Software Dependencies Yes We also demonstrate the scalability of the proposed LP by running CPLEX 12.8 simulations on a standard desktop computer with 128GB of RAM for random instances of the Frozen Island problem.
Experiment Setup Yes The agent begins on the larger island of size n n/2 with an initial uniform distribution over those states, i.e., β(s) = 2/n2 for every state belonging to the large island. ... For each island, we have (Llog1, [0.25, 1]), (Llog2, [0.25, 1]), (Lcanoe1, [0.05, 1.0]), (Lcanoe2, [0.05, 1.0]), (Lfish1, [0.1, 1.0]), (Lfish2, [0.1, 1.0]), R( , , Lfish1) = R( , , Lfish2) = 1, R( , , S\(Lfish1 Lfish2)) = 0. ... Namely, if the agent chooses to go right (left), there is a 90% chance that it will transition to the right (left), and the chance of transitioning to either of the states above or below it is 5%. Similarly, if the agent chooses to up (down), it will end up in the states above (below) it with 90% chance, and in the states to the right and left of it with chance 5% each. ... For these experiments, we have the constraints (Llog1 Llog2, [0.3, 1]), (Lcanoe1 Lcanoe2, [0.05, 1]) and reward function R( , , Lfish1 Lfish2) = 1, R( , , S\Lfish1 Lfish2) = 0.