The Impact of Task Underspecification in Evaluating Deep Reinforcement Learning
Authors: Vindula Jayawardana, Catherine Tang, Sirui Li, Dajiang Suo, Cathy Wu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that in comparison to evaluating DRL methods on select MDP instances, evaluating the MDP family often yields a substantially different relative ranking of methods, casting doubt on what methods should be considered state-of-the-art. We validate this phenomenon in standard control benchmarks and the real-world application of traffic signal control. At the same time, we show that accurately evaluating on an MDP family is nontrivial. |
| Researcher Affiliation | Academia | Vindula Jayawardana MIT vindula@mit.edu Catherine Tang MIT cattang@mit.edu MIT siruil@mit.edu Dajiang Suo MIT djsuo@mit.edu MIT cathywu@mit.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a direct link to its source code or explicitly state that the code is publicly available. |
| Open Datasets | Yes | We use three popular control tasks (Pendulum, Quad [23], and Swimmer) as example underspecified tasks. ... In particular, we consider the evaluations of DRL methods on the traffic signal control task, leveraging the RESCO benchmark [4]. ... The importance scores and the intersections used to build the point MDP distribution of the intersections are taken from Salt Lake City in Utah. |
| Dataset Splits | No | The paper discusses training and evaluation but does not specify explicit train/validation/test dataset splits, percentages, or sample counts. |
| Hardware Specification | No | The authors acknowledge MIT Super Cloud and the Lincoln Laboratory Supercomputing Center for providing computational resources supporting the research results in this paper. (This mentions general computing facilities but lacks specific hardware details like GPU/CPU models.) |
| Software Dependencies | No | The paper mentions DRL algorithms like PPO [38], TRPO [37], and TD3 [19] but does not provide specific version numbers for these implementations or any underlying software libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We fix the number of total training steps to 5M for swimmer and 2M for quad and pendulum. |