Provable Partially Observable Reinforcement Learning with Privileged Information

Authors: Yang Cai, Xiangyu Liu, Argyris Oikonomou, Kaiqing Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now provide some numerical results for both of our principled algorithms. Here we mainly compare with two baselines, the vanilla asymmetric actor-critic [68], and asymmetric Q-learning [7], on two settings, POMDP under the deterministic filter condition (c.f. Definition 3.2) and general POMDPs. We report the results in Table 2 and Figure 2, where our algorithms converge faster to higher rewards.
Researcher Affiliation Academia 1 Yale University 2University of Maryland, College Park
Pseudocode Yes D Collection of Algorithms: Algorithm 1 Learning Decoding Function with Privileged Information... Algorithm 2 Belief-Weighted Optimistic Asymmetric Actor-Critic with Privileged Information... Algorithm 3 Optimistic Q-function Estimation with Privileged Information... Algorithm 4 Approximate Belief Learning with Privileged Information via Model Truncation...
Open Source Code No The paper states in its NeurIPS checklist that it 'included detailed introductions on the algorithms and environment to reproduce our results', but it does not provide a direct link to a code repository, explicitly state that source code is open-source, or mention code availability in supplementary materials for the methodology developed.
Open Datasets No The paper mentions generating 'POMDPs randomly' for experiments ('we generated 20 POMDPs randomly'). It does not specify or provide access to any publicly available or open datasets with links, DOIs, repositories, or citations.
Dataset Splits No The paper mentions generating random POMDPs and evaluating them, but it does not specify explicit dataset splits (e.g., percentages, sample counts, or references to predefined splits) for training, validation, or testing.
Hardware Specification Yes Finally, all simulations are conducted on a personal laptop with Apple M1 CPU and 16 GB memory.
Software Dependencies No The paper mentions 'MDP learning algorithm' and methods like 'Q-value update' and 'ϵ-greedy exploration' but does not list specific software components with version numbers (e.g., Python, PyTorch, TensorFlow, or other libraries).
Experiment Setup Yes For baselines, the hyperparameters α for Q-value update and step size for the policy update are tuned by grid search, where α controls the update of temporal difference learning (recall the update rule of temporal difference learning as Q (1 α)Q + αQtarget). For asymmetric Q-learning, we use ϵ-greedy exploration, where we use the seminal decreasing rate ϵt = H+1 / H+t.