Learning to Explore in POMDPs with Informational Rewards

Authors: Annie Xie, Logan Mondal Bhamidipaty, Evan Zheran Liu, Joey Hong, Sergey Levine, Chelsea Finn

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments aim to study whether PROBE can learn effective exploration strategies across various POMDP problems with privileged state information at training time. ... Through experiments in several partially-observed environments, we find that our approach is competitive with prior methods when minimal exploration is needed, but substantially outperforms them when more complex strategies are required.
Researcher Affiliation Collaboration Annie Xie 1 Logan Mondal Bhamidipaty 1 Evan Zheran Liu 2 Joey Hong 3 Sergey Levine 3 Chelsea Finn 1 1Stanford University 2Imbue 3UC Berkeley.
Pseudocode Yes Algorithm 1 PROBE (single train episode)
Open Source Code Yes Videos and code can be found at https://sites.google.com/view/probe-explore-icml.
Open Datasets Yes Tiger Door (Littman et al., 1995). ... Light-Dark (Platt Jr et al., 2010). ... Map (Liu et al., 2021).
Dataset Splits No The paper does not specify exact dataset split percentages or sample counts for training, validation, or testing, nor does it explicitly refer to standard predefined splits with sufficient detail for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as CPU models, GPU models (e.g., NVIDIA A100, RTX 3090), or memory specifications.
Software Dependencies No The paper mentions using specific algorithms and network architectures (e.g., recurrent deep dueling double-Q network, LSTM, CNN, PPO) but does not specify the version numbers of any software libraries or frameworks (e.g., PyTorch 1.x, TensorFlow 2.x, scikit-learn x.x).
Experiment Setup Yes For all of our experiments, we choose K = 10 following DREAM. We minimize the sum of these four losses, and periodically update the target network. ... r PROBE,clipped t = min fψ(it+1) gω(ht) 2 2, D min fψ(it+1) gω(ht+1) 2 2, D , where we choose D = 1.0 for all of our experiments.