CROP: Towards Distributional-Shift Robust Reinforcement Learning Using Compact Reshaped Observation Processing
Authors: Philipp Altmann, Fabian Ritz, Leonard Feuchtinger, Jonas Nüßlein, Claudia Linnhoff-Popien, Thomy Phan
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show the improvements of CROP in a distributionally shifted safety gridworld. We furthermore provide benchmark comparisons to full observability and data-augmentation in two different-sized procedurally generated mazes. |
| Researcher Affiliation | Academia | Philipp Altmann , Fabian Ritz , Leonard Feuchtinger , Jonas N ußlein , Claudia Linnhoff-Popien and Thomy Phan LMU Munich philipp.altmann@ifi.lmu.de |
| Pseudocode | Yes | Algorithm 1 CROPed Policy Optimization |
| Open Source Code | Yes | All implementations for the following evaluations can be found here 1. https://github.com/philippaltmann/CROP |
| Open Datasets | Yes | To provide proof-of-concept for CROP we used two holey safety gridworlds inspired by [Leike et al., 2017]... For further evaluation and comparisons in section 7 we use (7, 7)- and (11, 11)-sized generated mazes inspired by [Cobbe et al., 2020] |
| Dataset Splits | No | The paper describes training and testing in different environment configurations (e.g., training in one gridworld, testing in a shifted one; using a pool of random mazes for training/testing), but does not specify explicit train/validation/test percentage splits or sample counts for a single dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | We furthermore built upon the implementations by [Raffin et al., 2021], extending upon [Brockman et al., 2016]. The paper mentions Stable-Baselines3 and OpenAI Gym but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For training PPO, we adopted the default parameters suggested by [Schulman et al., 2017; Raffin et al., 2021]. For Radius CROP we set the radius ρ = (2, 2), resulting in an observation shape of dim(s t ) = ρ 2 + 1 = (5, 5), padded with wall fields. Given the four possible actions A = {Up, Right, Down, Left} we parameterized Action CROP with µ = [( 1, 0), (0, 1), (1, 0), (0, 1)], resulting in an observation shape of dim(s t ) = |A| = (4). Regarding Object CROP we chose η = 1 for all safety environments and η = 2 for all mazes and the set of objects to be detected to be all possible objects excluding the agent itself: O = F \ {Agent}, resulting in O = {Wall, Field, Hole, Goal} and the observation shape dim(s t ) = (4, 2) for the train and test environments (cf. Figure 2a and Figure 2b), as well as O = {Wall, Field, Goal} and the observation shape dim(s t ) = (3, 2) for all maze environments (cf. Figure 2c and Figure 2d). All choices above were determined in preliminary experiments, omitted in this work due to limited space. |