reproducibilityindex.ai

Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures

Authors: Jonathan Uesato*, Ananya Kumar*, Csaba Szepesvari*, Tom Erez, Avraham Ruderman, Keith Anderson, Krishnamurthy (Dj) Dvijotham, Nicolas Heess, Pushmeet Kohli

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efﬁcacy of adversarial evaluation on two standard domains: humanoid control and simulated driving. Experimental results show that our methods can ﬁnd catastrophic failures and estimate failures rates of agents multiple orders of magnitude faster than standard evaluation schemes, in minutes to hours rather than days.
Researcher Affiliation	Industry	Jonathan Uesato , Ananya Kumar , Csaba Szepesvari , Tom Erez, Avraham Ruderman Keith Anderson, Krishmamurthy (Dj) Dvijotham, Nicolas Heess, Pushmeet Kohli Deep Mind, London, UK {juesato, ananyak, szepi, pushmeet}@google.com
Pseudocode	Yes	The pseudocode is included in Appendix C as Algorithm 2 (AVF Adversary). ... The pseudocode of the full procedure is given as Algorithm 1.
Open Source Code	No	The paper does not contain any explicit statement about providing open-source code for the described methodology or a link to a code repository.
Open Datasets	No	The paper uses the TORCS simulator (Wymann et al., 2000) and MuJoCo simulator (Todorov et al., 2012; Tassa et al., 2018) as environments, and describes how data (e.g., initial conditions, trajectories) are sampled or generated within these environments. However, it does not provide concrete access information (link, DOI, repository, or explicit citation to a specific dataset) for a publicly available, pre-existing dataset that was used for training or evaluation.
Dataset Splits	No	The paper mentions selecting hyperparameters for the AVF model 'based on a held-out test set using data collected during training', which functions as a validation step for the AVF model. However, it does not specify explicit training/validation/test splits for the main agent evaluation or the data generated from the environments.
Hardware Specification	No	The paper mentions using '100 CPU workers and a single GPU learner' and '32 CPU workers and a single GPU learner' for training, and 'a single GPU' for AVF model training. However, it does not provide specific hardware details such as CPU or GPU models, memory specifications, or cloud instance types.
Software Dependencies	No	The paper references TORCS and MuJoCo simulators and mentions using the Adam optimizer, but it does not provide specific version numbers for any software dependencies, programming languages, or libraries used in their implementation.
Experiment Setup	Yes	The paper provides extensive details regarding the experimental setup, including agent types (actor-critic, D4PG), training steps (1e9 actor steps, 4e6 learner steps), population size (5), exploration rates, use of demonstrations (1000 trajectories), data handling for AVF models (last 150,000/200,000 episodes), AVF training iterations (20,000/40,000), AVF architecture (4-layer MLP, 32 hidden units; DND with K=32 neighbors, 1-layer MLP to 16 dimensions, Gaussian kernel).