Robust Situational Reinforcement Learning in Face of Context Disturbances
Authors: Jinpeng Zhang, Yufeng Zheng, Chuheng Zhang, Li Zhao, Lei Song, Yuan Zhou, Jiang Bian
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on several locomotion tasks with dynamic contexts and inventory control tasks to demonstrate that our algorithm can generalize better and be more robust against context disturbances, and outperform existing basic RL algorithms that do not consider robustness and robust RL algorithms that consider robustness over the whole state transitions. |
| Researcher Affiliation | Collaboration | Jinpeng Zhang * 1 Yufeng Zheng 2 Chuheng Zhang 3 Li Zhao 3 Lei Song 3 Yuan Zhou 4 Jiang Bian 3 In many real-world tasks, the presence of dynamic and uncontrollable environmental factors, commonly referred to as context, plays a crucial role in the decision-making process. Examples of such factors include customer demand in inventory control and the speed of the lead car in autonomous driving. One of the challenges of reinforcement learning in these applications is that the true context transitions can be easily exposed to some unknown source of contamination, leading to a shift of context transitions between source domains and target domains, which could cause performance degradation for RL algorithms. To tackle this problem, we propose the robust situational Markov decision process (RS-MDP) framework which captures the possible deviations of context transitions explicitly. To scale to large context space, we introduce the softmin smoothed robust Bellman operator to learn the robust Q-value approximately, and extend existing RL algorithm SAC to learn the desired robust policies under our RS-MDP framework. We conduct experiments on several locomotion tasks with dynamic contexts and inventory control tasks to demonstrate that our algorithm can generalize better and be more robust against context disturbances, and outperform existing basic RL algorithms that do not consider robustness and robust RL algorithms that consider robustness over the whole state transitions. This work is conducted at Microsoft. 1Department of Mathematical Sciences, Tsinghua University 2Rotman business school, University of Toronto 3Microsoft Research Asia 4Yau Mathematical Sciences Center and Department of Mathematical Sciences, Tsinghua University. Correspondence to: Li Zhao <lizo@microsoft.com>. |
| Pseudocode | No | The paper does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., specific repository link, explicit statement of code release) for its methodology. |
| Open Datasets | No | The paper mentions modifying 'standard Mu Jo Co (Todorov et al., 2012) tasks' and using 'historical data of customer demands from 50 Stock Keeping Units (SKUs)' for training, but does not provide specific access information (e.g., a link, DOI, or a citation to the specific customer demand dataset) for the datasets used in their experiments. |
| Dataset Splits | No | The paper states '50 Stock Keeping Units (SKUs) are used to build the training simulators, and fixed sequences of customer demands from other 5 SKUs serve as target domains to test RL policies' which indicates a split for source and target domains, but it does not specify a standard train/validation/test split with percentages or sample counts for the datasets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like 'Mu Jo Co' and various SAC-based algorithms, but does not specify exact version numbers for any software dependencies (e.g., 'Python 3.x', 'PyTorch x.x'). |
| Experiment Setup | Yes | Table 1. Specific hyperparameters for RS-SAC", "Table 2. Shared hyperparameters for all algorithms". These tables include values for parameters such as 'β', 'τ', 'noise clip (c)', 'noise samples (K)', 'number of hidden layers', 'number of units per layer', 'activation', 'optimizer', 'discount factor', 'learning rate', 'replay buffer size', 'batch size', 'target entropy', 'soft update coefficient', and 'soft update interval'. |