Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

Authors: Cameron Allen, Aaron Kirtland, Ruo Yu Tao, Sam Lobel, Daniel Scott, Nicholas Petrocelli, Omer Gottesman, Ronald Parr, Michael Littman, George Konidaris

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also demonstrate empirically that, once detected, minimizing the λ-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different λ parameters and minimizes the difference between them as an auxiliary loss.
Researcher Affiliation Collaboration Cameron Allen UC Berkeley Aaron Kirtland Brown University Ruo Yu Tao Brown University Sam Lobel Brown University Daniel Scott Georgia Tech Nicholas Petrocelli Brown University Omer Gottesman Amazon Ronald Parr Duke University Michael L. Littman Brown University George Konidaris Brown University
Pseudocode Yes Algorithm 1 describes our memory optimization procedure, which reduces the λ-discrepancy to learn a memory function, then learns an optimal memory-augmented policy. ... In Algorithm 1, the policy improvement function can be any function which improves a parameterized policy. ... In this section we provide pseudocode (Algorithm 2) for the memory-learning algorithm described in Section 4 and used in Algorithm 1. ... In this section, we define the memory-Cartesian product function, expand over memory(), used by Algorithms 1 and 2. This function computes the Cartesian product of the POMDP P and the memory state space M, as described in Appendix G.1. Algorithm 3 Memory Cartesian Product (expand over memory).
Open Source Code Yes Code: https://github.com/brownirl/lambda_discrepancy
Open Datasets Yes All four of the environments used for evaluation in Section 5 were all re-implementations of environments used in Silver and Veness [2010], in JAX [Bradbury et al., 2018], allowing for massive hardware acceleration. We now give details of these environments. ... Battleship [Silver and Veness, 2010] ... Partially observable Pac Man [Silver and Veness, 2010] ... Rock Sample (11, 11) and Rock Sample (15, 15) [Smith and Simmons, 2004]
Dataset Splits No The paper describes hyperparameter tuning by running experiments and selecting the best performing hyperparameters based on Area Under the Learning Curve (AUC), but it does not explicitly state a 'validation dataset' or a 'validation split' for its reinforcement learning environments.
Hardware Specification Yes The hyperparameter sweep was performed on a cluster of NVIDIA 3090 GPUs, and the best seeds presented were run on one GPU for each algorithm, running for 1 to 12 hours, depending on the domain.
Software Dependencies No Our base PPO algorithm is an online PPO algorithm that trains over a vectorized environment, all parallelized using JAX [Bradbury et al., 2018] and the Pure Jax RL batch experimentation library [Lu et al., 2022] with hardware acceleration.
Experiment Setup Yes We now detail the hyperparameters swept across all environments and all algorithms. We do so in Table 1. Hyperparameter: Step size [2.5 10 3, 2.5 10 4, 2.5 10 5, 2.5 10 6] λ1 [0.1, 0.5, 0.7, 0.9, 0.95] λ2 (λ-discrepancy) [0.1, 0.5, 0.7, 0.9, 0.95] β (λ-discrepancy) [0, 0.125, 0.25, 0.5]. ... Our LCLIP clipping ϵ is set to 0.2. The value loss coefficient is set to c V = 0.5. We also anneal our learning rate over all training steps, and clip gradients when the norm is larger than 0.5 by their global norm [Pascanu et al., 2013].