reproducibilityindex.ai

Learning to Explore in POMDPs with Informational Rewards

Authors: Annie Xie, Logan Mondal Bhamidipaty, Evan Zheran Liu, Joey Hong, Sergey Levine, Chelsea Finn

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments aim to study whether PROBE can learn effective exploration strategies across various POMDP problems with privileged state information at training time. ... Through experiments in several partially-observed environments, we find that our approach is competitive with prior methods when minimal exploration is needed, but substantially outperforms them when more complex strategies are required.
Researcher Affiliation	Collaboration	Annie Xie 1 Logan Mondal Bhamidipaty 1 Evan Zheran Liu 2 Joey Hong 3 Sergey Levine 3 Chelsea Finn 1 1Stanford University 2Imbue 3UC Berkeley.
Pseudocode	Yes	Algorithm 1 PROBE (single train episode)
Open Source Code	Yes	Videos and code can be found at https://sites.google.com/view/probe-explore-icml.
Open Datasets	Yes	Tiger Door (Littman et al., 1995). ... Light-Dark (Platt Jr et al., 2010). ... Map (Liu et al., 2021).
Dataset Splits	No	The paper does not specify exact dataset split percentages or sample counts for training, validation, or testing, nor does it explicitly refer to standard predefined splits with sufficient detail for reproducibility.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as CPU models, GPU models (e.g., NVIDIA A100, RTX 3090), or memory specifications.
Software Dependencies	No	The paper mentions using specific algorithms and network architectures (e.g., recurrent deep dueling double-Q network, LSTM, CNN, PPO) but does not specify the version numbers of any software libraries or frameworks (e.g., PyTorch 1.x, TensorFlow 2.x, scikit-learn x.x).
Experiment Setup	Yes	For all of our experiments, we choose K = 10 following DREAM. We minimize the sum of these four losses, and periodically update the target network. ... r PROBE,clipped t = min fψ(it+1) gω(ht) 2 2, D min fψ(it+1) gω(ht+1) 2 2, D , where we choose D = 1.0 for all of our experiments.