Multi-Critic Actor Learning: Teaching RL Policies to Act with Style

Authors: Siddharth Mysore, George Cheng, Yunqi Zhao, Kate Saenko, Meng Wu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed Multi Critic AL method in a series of multi-style RL problems. We first test our method on two basic multi-style learning environments... Multi Critic AL is tested in the context of multi-style learning... and yields up to 56% performance gains over the single-critic baselines and even successfully learns behavior styles in cases where single-critic approaches may simply fail to learn.
Researcher Affiliation Collaboration Siddharth Mysore Boston University & Electronic Arts George Cheng & Yunqi Zhao Electronic Arts Kate Saenko Boston University & MIT-IBM Watson AI Lab Meng Wu Electronic Arts
Pseudocode No The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured algorithm blocks.
Open Source Code Yes Training code is based on Open AI s Spinning Up (Achiam, 2018) and is provided in the Supplementary Material for the multi-style environments. The modified code-base is included in the Supplementary Material for this paper, along with the code for the custom environments for Path Following (presented in Section 5.1) and Pong with rotatable paddles (presented in Section 5.2).
Open Datasets Yes Our implementation is based on the Pong environment provided by the Pygame Learning Environment (PLE) (Tasfi, 2016). Additionally, we test Multi Critic AL on learning to play multiple levels of the Sega genesis games Sonic the Hedgehog, and Sonic the Hedgehog 2, where the agent is trained to play the different levels of each game, following the reward structure devised for the Gym Retro contest (Nichol et al., 2018).
Dataset Splits No The paper describes various training configurations and evaluations but does not explicitly provide details about training, validation, or test dataset splits in terms of percentages or sample counts.
Hardware Specification No The paper mentions 'hardware parallelism' but does not specify any particular hardware components such as GPU models, CPU types, or memory used for the experiments.
Software Dependencies No The paper mentions several software components and frameworks like 'Open AI s Spinning Up', 'Pygame Learning Environment (PLE)', 'Gym-retro wrappers', and 'Open AI Baselines', but it does not provide specific version numbers for any of them.
Experiment Setup Yes We train Multi Critic AL with two popular contemporary actor-critic algorithms: PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018)... Hidden layer configurations for both the actors and critics per environment: Path following: PPO [64,64], SAC [8,8]... Reward Design The reward breakdown for the environment is as follows: Standard to all styles Victory reward: +1; Strike request penalty: -0.001... Run configurations: Single-style PPO Trained for 100 epochs with 4000 steps per epoch and 90 steps per episode...