Efficient Deep Reinforcement Learning Requires Regulating Overfitting

Authors: Qiyang Li, Aviral Kumar, Ilya Kostrikov, Sergey Levine

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform thorough empirical analysis on state-based Deep Mind control suite (DMC) tasks in a controlled and systematic way to show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the validation TD error to be low.
Researcher Affiliation Academia Qiyang Li, Aviral Kumar, Ilya Kostrikov, Sergey Levine UC Berkeley {qcli,aviralk,kostrikov,svlevine}@berkeley.edu
Pseudocode Yes Algorithm 1 AVTD
Open Source Code Yes For the SAC implementation used in this paper, we build our code on top of the jaxrl codebase: https://github.com/ikostrikov/jaxrl (Kostrikov, 2021).
Open Datasets Yes We perform thorough empirical analysis on state-based Deep Mind control suite (DMC) tasks in a controlled and systematic way to show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the validation TD error to be low.
Dataset Splits Yes After every 10 episodes, collect a heldout trajectory and add to Dheldout with the same action selection strategy above for D.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments.
Software Dependencies No The paper mentions building code on the 'jaxrl codebase' and using 'Adam W' and 'Re LU activation', but it does not specify versions for any key software components or libraries.
Experiment Setup Yes Initial Temperature 1.0; Target Update Rate update rate of target networks 0.005; Learning Rate learning rate for the Adam optimizer 0.0003; Discount Factor 0.99; Batch Size 256; Network Size (256, 256); Warmup Period # of initial random exploration steps 10000 for DMC, 5000 for gym Mu Jo Co.