Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy
Authors: Cameron Allen, Aaron Kirtland, Ruo Yu Tao, Sam Lobel, Daniel Scott, Nicholas Petrocelli, Omer Gottesman, Ronald Parr, Michael Littman, George Konidaris
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also demonstrate empirically that, once detected, minimizing the λ-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different λ parameters and minimizes the difference between them as an auxiliary loss. |
| Researcher Affiliation | Collaboration | Cameron Allen UC Berkeley Aaron Kirtland Brown University Ruo Yu Tao Brown University Sam Lobel Brown University Daniel Scott Georgia Tech Nicholas Petrocelli Brown University Omer Gottesman Amazon Ronald Parr Duke University Michael L. Littman Brown University George Konidaris Brown University |
| Pseudocode | Yes | Algorithm 1 describes our memory optimization procedure, which reduces the λ-discrepancy to learn a memory function, then learns an optimal memory-augmented policy. ... In Algorithm 1, the policy improvement function can be any function which improves a parameterized policy. ... In this section we provide pseudocode (Algorithm 2) for the memory-learning algorithm described in Section 4 and used in Algorithm 1. ... In this section, we define the memory-Cartesian product function, expand over memory(), used by Algorithms 1 and 2. This function computes the Cartesian product of the POMDP P and the memory state space M, as described in Appendix G.1. Algorithm 3 Memory Cartesian Product (expand over memory). |
| Open Source Code | Yes | Code: https://github.com/brownirl/lambda_discrepancy |
| Open Datasets | Yes | All four of the environments used for evaluation in Section 5 were all re-implementations of environments used in Silver and Veness [2010], in JAX [Bradbury et al., 2018], allowing for massive hardware acceleration. We now give details of these environments. ... Battleship [Silver and Veness, 2010] ... Partially observable Pac Man [Silver and Veness, 2010] ... Rock Sample (11, 11) and Rock Sample (15, 15) [Smith and Simmons, 2004] |
| Dataset Splits | No | The paper describes hyperparameter tuning by running experiments and selecting the best performing hyperparameters based on Area Under the Learning Curve (AUC), but it does not explicitly state a 'validation dataset' or a 'validation split' for its reinforcement learning environments. |
| Hardware Specification | Yes | The hyperparameter sweep was performed on a cluster of NVIDIA 3090 GPUs, and the best seeds presented were run on one GPU for each algorithm, running for 1 to 12 hours, depending on the domain. |
| Software Dependencies | No | Our base PPO algorithm is an online PPO algorithm that trains over a vectorized environment, all parallelized using JAX [Bradbury et al., 2018] and the Pure Jax RL batch experimentation library [Lu et al., 2022] with hardware acceleration. |
| Experiment Setup | Yes | We now detail the hyperparameters swept across all environments and all algorithms. We do so in Table 1. Hyperparameter: Step size [2.5 10 3, 2.5 10 4, 2.5 10 5, 2.5 10 6] λ1 [0.1, 0.5, 0.7, 0.9, 0.95] λ2 (λ-discrepancy) [0.1, 0.5, 0.7, 0.9, 0.95] β (λ-discrepancy) [0, 0.125, 0.25, 0.5]. ... Our LCLIP clipping ϵ is set to 0.2. The value loss coefficient is set to c V = 0.5. We also anneal our learning rate over all training steps, and clip gradients when the norm is larger than 0.5 by their global norm [Pascanu et al., 2013]. |