Learning to Explore in POMDPs with Informational Rewards
Authors: Annie Xie, Logan Mondal Bhamidipaty, Evan Zheran Liu, Joey Hong, Sergey Levine, Chelsea Finn
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments aim to study whether PROBE can learn effective exploration strategies across various POMDP problems with privileged state information at training time. ... Through experiments in several partially-observed environments, we find that our approach is competitive with prior methods when minimal exploration is needed, but substantially outperforms them when more complex strategies are required. |
| Researcher Affiliation | Collaboration | Annie Xie 1 Logan Mondal Bhamidipaty 1 Evan Zheran Liu 2 Joey Hong 3 Sergey Levine 3 Chelsea Finn 1 1Stanford University 2Imbue 3UC Berkeley. |
| Pseudocode | Yes | Algorithm 1 PROBE (single train episode) |
| Open Source Code | Yes | Videos and code can be found at https://sites.google.com/view/probe-explore-icml. |
| Open Datasets | Yes | Tiger Door (Littman et al., 1995). ... Light-Dark (Platt Jr et al., 2010). ... Map (Liu et al., 2021). |
| Dataset Splits | No | The paper does not specify exact dataset split percentages or sample counts for training, validation, or testing, nor does it explicitly refer to standard predefined splits with sufficient detail for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as CPU models, GPU models (e.g., NVIDIA A100, RTX 3090), or memory specifications. |
| Software Dependencies | No | The paper mentions using specific algorithms and network architectures (e.g., recurrent deep dueling double-Q network, LSTM, CNN, PPO) but does not specify the version numbers of any software libraries or frameworks (e.g., PyTorch 1.x, TensorFlow 2.x, scikit-learn x.x). |
| Experiment Setup | Yes | For all of our experiments, we choose K = 10 following DREAM. We minimize the sum of these four losses, and periodically update the target network. ... r PROBE,clipped t = min fψ(it+1) gω(ht) 2 2, D min fψ(it+1) gω(ht+1) 2 2, D , where we choose D = 1.0 for all of our experiments. |