The Impact of Task Underspecification in Evaluating Deep Reinforcement Learning

Authors: Vindula Jayawardana, Catherine Tang, Sirui Li, Dajiang Suo, Cathy Wu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that in comparison to evaluating DRL methods on select MDP instances, evaluating the MDP family often yields a substantially different relative ranking of methods, casting doubt on what methods should be considered state-of-the-art. We validate this phenomenon in standard control benchmarks and the real-world application of traffic signal control. At the same time, we show that accurately evaluating on an MDP family is nontrivial.
Researcher Affiliation Academia Vindula Jayawardana MIT vindula@mit.edu Catherine Tang MIT cattang@mit.edu MIT siruil@mit.edu Dajiang Suo MIT djsuo@mit.edu MIT cathywu@mit.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a direct link to its source code or explicitly state that the code is publicly available.
Open Datasets Yes We use three popular control tasks (Pendulum, Quad [23], and Swimmer) as example underspecified tasks. ... In particular, we consider the evaluations of DRL methods on the traffic signal control task, leveraging the RESCO benchmark [4]. ... The importance scores and the intersections used to build the point MDP distribution of the intersections are taken from Salt Lake City in Utah.
Dataset Splits No The paper discusses training and evaluation but does not specify explicit train/validation/test dataset splits, percentages, or sample counts.
Hardware Specification No The authors acknowledge MIT Super Cloud and the Lincoln Laboratory Supercomputing Center for providing computational resources supporting the research results in this paper. (This mentions general computing facilities but lacks specific hardware details like GPU/CPU models.)
Software Dependencies No The paper mentions DRL algorithms like PPO [38], TRPO [37], and TD3 [19] but does not provide specific version numbers for these implementations or any underlying software libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We fix the number of total training steps to 5M for swimmer and 2M for quad and pendulum.