No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO
Authors: Skander Moalla, Andrea Miele, Daniil Pyatko, Razvan Pascanu, Caglar Gulcehre
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we empirically study representation dynamics in Proximal Policy Optimization (PPO) on the Atari and Mu Jo Co environments, revealing that PPO agents are also affected by feature rank deterioration and capacity loss. |
| Researcher Affiliation | Collaboration | Skander Moalla1 Andrea Miele1 Daniil Pyatko1 Razvan Pascanu2 Caglar Gulcehre1 1 CLAIRE, EPFL 2 Google Deep Mind |
| Pseudocode | Yes | We refer to PPO-Clip as PPO and provide a pseudocode in Algorithm 1. |
| Open Source Code | Yes | Code and run histories are available at https://github.com/CLAIRE-Labo/no-representation-no-trust. |
| Open Datasets | Yes | We begin our experiments by training PPO agents on the Arcade Learning Environment (ALE)(Bellemare et al., 2013) for pixel-based observations with discrete actions and on Mu Jo Co (Todorov et al., 2012) for continuous observations with continuous actions. ... To keep our experiments tractable, we choose the Atari-5 subset recommended by Aitchison et al. (2023) |
| Dataset Splits | No | The paper describes how data is collected in rollouts during training (e.g., 'Collect a batch of interaction steps of size B = N Benv and computes advantages') and used for training but does not specify static training, validation, or test dataset splits in terms of percentages or fixed sample counts, as data is dynamically generated. |
| Hardware Specification | Yes | The experiments in this project took a total of ~11,300 GPU hours on NVIDIA V100 and A100 GPUs (ALE) and ~25,500 CPU hours (Mu Jo Co). A run on ALE takes around 10 hours on an A100 and 16 hours on a V100. A run on Mu Jo Co takes around 5 hours on 6 CPUs. |
| Software Dependencies | No | Our codebase uses Torch RL (Bou et al., 2024) and provides a comprehensive toolbox to study representation dynamics in policy optimization. We also provide modified scripts of Clean RL (Huang et al., 2022b) to replicate the collapse observed in this work and ensure it is not a bug from our novel codebase. |
| Experiment Setup | Yes | We provide a high-level pseudocode for PPO in Algorithm 1 and list all hyperparameters considered in Tables 2 and 3. |