reproducibilityindex.ai

Efficient Deep Reinforcement Learning Requires Regulating Overfitting

Authors: Qiyang Li, Aviral Kumar, Ilya Kostrikov, Sergey Levine

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform thorough empirical analysis on state-based Deep Mind control suite (DMC) tasks in a controlled and systematic way to show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the validation TD error to be low.
Researcher Affiliation	Academia	Qiyang Li, Aviral Kumar, Ilya Kostrikov, Sergey Levine UC Berkeley {qcli,aviralk,kostrikov,svlevine}@berkeley.edu
Pseudocode	Yes	Algorithm 1 AVTD
Open Source Code	Yes	For the SAC implementation used in this paper, we build our code on top of the jaxrl codebase: https://github.com/ikostrikov/jaxrl (Kostrikov, 2021).
Open Datasets	Yes	We perform thorough empirical analysis on state-based Deep Mind control suite (DMC) tasks in a controlled and systematic way to show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the validation TD error to be low.
Dataset Splits	Yes	After every 10 episodes, collect a heldout trajectory and add to Dheldout with the same action selection strategy above for D.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments.
Software Dependencies	No	The paper mentions building code on the 'jaxrl codebase' and using 'Adam W' and 'Re LU activation', but it does not specify versions for any key software components or libraries.
Experiment Setup	Yes	Initial Temperature 1.0; Target Update Rate update rate of target networks 0.005; Learning Rate learning rate for the Adam optimizer 0.0003; Discount Factor 0.99; Batch Size 256; Network Size (256, 256); Warmup Period # of initial random exploration steps 10000 for DMC, 5000 for gym Mu Jo Co.