Provable Partially Observable Reinforcement Learning with Privileged Information
Authors: Yang Cai, Xiangyu Liu, Argyris Oikonomou, Kaiqing Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now provide some numerical results for both of our principled algorithms. Here we mainly compare with two baselines, the vanilla asymmetric actor-critic [68], and asymmetric Q-learning [7], on two settings, POMDP under the deterministic filter condition (c.f. Definition 3.2) and general POMDPs. We report the results in Table 2 and Figure 2, where our algorithms converge faster to higher rewards. |
| Researcher Affiliation | Academia | 1 Yale University 2University of Maryland, College Park |
| Pseudocode | Yes | D Collection of Algorithms: Algorithm 1 Learning Decoding Function with Privileged Information... Algorithm 2 Belief-Weighted Optimistic Asymmetric Actor-Critic with Privileged Information... Algorithm 3 Optimistic Q-function Estimation with Privileged Information... Algorithm 4 Approximate Belief Learning with Privileged Information via Model Truncation... |
| Open Source Code | No | The paper states in its NeurIPS checklist that it 'included detailed introductions on the algorithms and environment to reproduce our results', but it does not provide a direct link to a code repository, explicitly state that source code is open-source, or mention code availability in supplementary materials for the methodology developed. |
| Open Datasets | No | The paper mentions generating 'POMDPs randomly' for experiments ('we generated 20 POMDPs randomly'). It does not specify or provide access to any publicly available or open datasets with links, DOIs, repositories, or citations. |
| Dataset Splits | No | The paper mentions generating random POMDPs and evaluating them, but it does not specify explicit dataset splits (e.g., percentages, sample counts, or references to predefined splits) for training, validation, or testing. |
| Hardware Specification | Yes | Finally, all simulations are conducted on a personal laptop with Apple M1 CPU and 16 GB memory. |
| Software Dependencies | No | The paper mentions 'MDP learning algorithm' and methods like 'Q-value update' and 'ϵ-greedy exploration' but does not list specific software components with version numbers (e.g., Python, PyTorch, TensorFlow, or other libraries). |
| Experiment Setup | Yes | For baselines, the hyperparameters α for Q-value update and step size for the policy update are tuned by grid search, where α controls the update of temporal difference learning (recall the update rule of temporal difference learning as Q (1 α)Q + αQtarget). For asymmetric Q-learning, we use ϵ-greedy exploration, where we use the seminal decreasing rate ϵt = H+1 / H+t. |