A Policy Gradient Method for Confounded POMDPs
Authors: Mao Hong, Zhengling Qi, Yanxun Xu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of Algorithm 1 by conducting a simulation study using RKHS endowed with a Gaussian kernel. Details of of the simulation setup can be found in Appendix L. Figure 2 summarizes the performance of three methods: the proposed method, the naive method and behavioral cloning. |
| Researcher Affiliation | Academia | Mao Hong Johns Hopkins University mhong26@jhu.edu Zhengling Qi George Washington University qizhengling@gwu.edu Yanxun Xu Johns Hopkins University yanxun.xu@jhu.edu |
| Pseudocode | Yes | Algorithm 1 Policy gradient ascent for POMDP in offline RL |
| Open Source Code | No | The paper does not include any explicit statements about releasing source code or provide links to a code repository. |
| Open Datasets | No | The paper states: 'We evaluate the performance of Algorithm 1 by conducting a simulation study'. It then describes how the data is generated: 'Simulated data are generated by assuming S0 unif( 2, 2), O0 0.8N(S0, 0.1) + 0.2N( S0, 0.1), S1 N(S0, 0.1), S2 N(S1A1, 0.01), πb(+1 | St > 0) = πb( 1 | St < 0) = 0.3, Ot 0.8N(St, 0.1) + 0.2N( St, 0.1), Rt(St, At) = 2 1 + exp( 4St At) 1, πθt(At | Ot) exp(θ t ϕt(At, Ot)), where ϕt,1(at, ot) := 2ot I(at > 0, ot > 0), ϕt,2(at, ot) := 2ot I(at < 0, ot > 0), ϕt,3(at, ot) := 2ot I(at > 0, ot < 0), ϕt,1(at, ot) := 2ot I(at < 0, ot < 0) for t = 1, 2.' This is a description of data generation, not a publicly available dataset with a link or citation. |
| Dataset Splits | No | The paper describes a simulation study where data is generated for the experiment, but it does not specify any training/validation/test dataset splits or mention cross-validation. The phrase 'offline data' is used to refer to the generated dataset used for learning. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments or simulations. |
| Software Dependencies | No | The paper mentions 'RKHS endowed with a Gaussian kernel' in the simulation details, but it does not specify any particular software libraries, frameworks, or their version numbers (e.g., Python, PyTorch, TensorFlow, scikit-learn) that would be needed for replication. |
| Experiment Setup | Yes | The paper provides specific experimental setup details such as the total length of the horizon T = 2, the number of samples N = 10000, and the number of iterations K = 60 for the gradient ascent algorithm. It also mentions 'step sizes {ηk}K 1 k=0' and 'tuning parameters λN, µN, ξN'. |