reproducibilityindex.ai

No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO

Authors: Skander Moalla, Andrea Miele, Daniil Pyatko, Razvan Pascanu, Caglar Gulcehre

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we empirically study representation dynamics in Proximal Policy Optimization (PPO) on the Atari and Mu Jo Co environments, revealing that PPO agents are also affected by feature rank deterioration and capacity loss.
Researcher Affiliation	Collaboration	Skander Moalla1 Andrea Miele1 Daniil Pyatko1 Razvan Pascanu2 Caglar Gulcehre1 1 CLAIRE, EPFL 2 Google Deep Mind
Pseudocode	Yes	We refer to PPO-Clip as PPO and provide a pseudocode in Algorithm 1.
Open Source Code	Yes	Code and run histories are available at https://github.com/CLAIRE-Labo/no-representation-no-trust.
Open Datasets	Yes	We begin our experiments by training PPO agents on the Arcade Learning Environment (ALE)(Bellemare et al., 2013) for pixel-based observations with discrete actions and on Mu Jo Co (Todorov et al., 2012) for continuous observations with continuous actions. ... To keep our experiments tractable, we choose the Atari-5 subset recommended by Aitchison et al. (2023)
Dataset Splits	No	The paper describes how data is collected in rollouts during training (e.g., 'Collect a batch of interaction steps of size B = N Benv and computes advantages') and used for training but does not specify static training, validation, or test dataset splits in terms of percentages or fixed sample counts, as data is dynamically generated.
Hardware Specification	Yes	The experiments in this project took a total of ~11,300 GPU hours on NVIDIA V100 and A100 GPUs (ALE) and ~25,500 CPU hours (Mu Jo Co). A run on ALE takes around 10 hours on an A100 and 16 hours on a V100. A run on Mu Jo Co takes around 5 hours on 6 CPUs.
Software Dependencies	No	Our codebase uses Torch RL (Bou et al., 2024) and provides a comprehensive toolbox to study representation dynamics in policy optimization. We also provide modified scripts of Clean RL (Huang et al., 2022b) to replicate the collapse observed in this work and ensure it is not a bug from our novel codebase.
Experiment Setup	Yes	We provide a high-level pseudocode for PPO in Algorithm 1 and list all hyperparameters considered in Tables 2 and 3.