reproducibilityindex.ai

The Impact of Task Underspecification in Evaluating Deep Reinforcement Learning

Authors: Vindula Jayawardana, Catherine Tang, Sirui Li, Dajiang Suo, Cathy Wu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that in comparison to evaluating DRL methods on select MDP instances, evaluating the MDP family often yields a substantially different relative ranking of methods, casting doubt on what methods should be considered state-of-the-art. We validate this phenomenon in standard control benchmarks and the real-world application of trafﬁc signal control. At the same time, we show that accurately evaluating on an MDP family is nontrivial.
Researcher Affiliation	Academia	Vindula Jayawardana MIT vindula@mit.edu Catherine Tang MIT cattang@mit.edu MIT siruil@mit.edu Dajiang Suo MIT djsuo@mit.edu MIT cathywu@mit.edu
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a direct link to its source code or explicitly state that the code is publicly available.
Open Datasets	Yes	We use three popular control tasks (Pendulum, Quad [23], and Swimmer) as example underspeciﬁed tasks. ... In particular, we consider the evaluations of DRL methods on the trafﬁc signal control task, leveraging the RESCO benchmark [4]. ... The importance scores and the intersections used to build the point MDP distribution of the intersections are taken from Salt Lake City in Utah.
Dataset Splits	No	The paper discusses training and evaluation but does not specify explicit train/validation/test dataset splits, percentages, or sample counts.
Hardware Specification	No	The authors acknowledge MIT Super Cloud and the Lincoln Laboratory Supercomputing Center for providing computational resources supporting the research results in this paper. (This mentions general computing facilities but lacks specific hardware details like GPU/CPU models.)
Software Dependencies	No	The paper mentions DRL algorithms like PPO [38], TRPO [37], and TD3 [19] but does not provide specific version numbers for these implementations or any underlying software libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We ﬁx the number of total training steps to 5M for swimmer and 2M for quad and pendulum.