Observational Overfitting in Reinforcement Learning

Authors: Xingyou Song, Yiding Jiang, Stephen Tu, Yilun Du, Behnam Neyshabur

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments expose intriguing properties especially with regards to implicit regularization, and also corroborate results from previous works in RL generalization and supervised learning (SL).
Researcher Affiliation Collaboration Xingyou Song , Yiding Jiang , Stephen Tu, Behnam Neyshabur Google {xingyousong,ydjiang,stephentu,neyshabur}@google.com Yilun Du MIT yilundu@mit.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code No The paper only references third-party repositories for models or tools used, not explicit access to the authors' own source code for the methodology presented in the paper.
Open Datasets Yes We study observational overfitting with linear quadratic regulators (LQR) in a synthetic environment and neural networks such as multi-layer perceptrons (MLPs) and convolutions in classic Gym environments.
Dataset Splits No The paper mentions 'training levels' and 'test time' for environments like Gym and Coin Run (e.g., '10 training levels'), but does not provide specific percentages, sample counts, or detailed methodology for train/validation/test splits needed for reproduction.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or types of computing resources used for running experiments. It only vaguely mentions 'GPU'.
Software Dependencies No The paper mentions software components like 'TensorFlow' and 'PPO2' but does not provide specific version numbers for these or any other ancillary software dependencies, which are necessary for reproducibility.
Experiment Setup Yes A.3.4 PPO PARAMETERS For the projected gym tasks, we used for PPO2 Hyperparameters: PPO2 Hyperparameters Values nsteps 2048 nenvs 16 nminibatches 64 λ 0.95 γ 0.99 noptepochs 10 entropy 0.0 learning rate 3 10 4 vf coeffiicent 0.5 max-grad-norm 0.5 total time steps Varying