Discovering a set of policies for the worst case reward
Authors: Tom Zahavy, Andre Barreto, Daniel J Mankowitz, Shaobo Hou, Brendan O'Donoghue, Iurii Kemaev, Satinder Singh
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate our algorithm on a grid world and also on a set of domains from the Deep Mind control suite. We confirm our theoretical results regarding the monotonically improving performance of our algorithm. |
| Researcher Affiliation | Industry | Tom Zahavy , Andre Barreto, Daniel J Mankowitz, Shaobo Hou, Brendan O Donoghue, Iurii Kemaev and Satinder Singh Deep Mind |
| Pseudocode | Yes | Algorithm 1 SMP worst case policy iteration |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing source code for the described methodology or a direct link to a source-code repository. |
| Open Datasets | Yes | We empirically evaluate our algorithm on a grid world and also on a set of domains from the Deep Mind control suite. Next, we conducted a set of experiments in the DM Control Suite (Tassa et al., 2018). |
| Dataset Splits | No | The paper describes training policies and evaluating them, but it does not specify explicit dataset splits (e.g., percentages or counts) for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions specific software components like "STACX (Zahavy et al., 2020d)" and "CVXPY (Diamond & Boyd, 2016)" but does not provide version numbers for them. |
| Experiment Setup | Yes | At each iteration (x-axis) of Algorithm 1 we train a policy for 5 105 steps to maximize w SMP Πt . We then compute the SFs of that policy using additional 5 105 steps and evaluate it w.r.t w SMP Πt . At each iteration of Algorithm 1 we train a policy for 2 106 steps using an actor-critic (and specifically STACX (Zahavy et al., 2020d)) to maximize w SMP Πt , add it to the set, and compute a new w SMP Πt+1. We focused on the setup where the agent is learning from feature observations corresponding to the positions and velocities of the body in the task (pixels were only used for visualization). |