Instabilities of Offline RL with Pre-Trained Neural Representation

Authors: Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, Sham Kakade

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on a range of tasks, we see that substantial error amplification does occur even when using such pre-trained representations (trained on the same task itself); we find offline RL is stable only under extremely mild distribution shift. The goal of our experimental evaluation is to understand whether offline RL methods are sensitive to distribution shift in practical tasks, given a good representation (features extracted from pre-trained neural networks or random features).
Researcher Affiliation Collaboration Ruosong Wang 1 Yifan Wu 1 Ruslan Salakhutdinov 1 Sham M. Kakade 2 3 1Carnegie Mellon University, Pittsburgh, PA, USA 2Microsoft Research, New York, NY, USA 3University of Washington, Seattle, WA, USA.
Pseudocode Yes Algorithm 1 Fitted Q-Iteration (FQI) 1: Input: policy to be evaluated, number of samples N, regularization parameter λ > 0, number of rounds T 2: Take samples (si, ai) µ, ri r(si, ai) and si P(si, ai) for each i 2 [N] 3: ˆ = 1 i2[N] φ(si, ai)φ(si, ai)> + λI 4: Q0( , ) = 0 and V0( ) = 0 5: for t = 1, 2, . . . , T do 6: ˆ t = ˆ 1( 1 i=1 φ(si, ai) (ri + γ ˆVt 1(si))) 7: ˆQt( , ) = φ( , )>ˆ t and ˆVt( ) = ˆQt( , ( )) 8: end for 9: return ˆQT ( , )
Open Source Code No The information is insufficient. The paper does not provide any statement or link indicating that the source code for their methodology is publicly available.
Open Datasets Yes Our experiments are performed on a range of challenging tasks from the Open AI gym benchmark suite (Brockman et al., 2016), including two environments with discrete action space (Mountain Car-v0, Cart Pole-v0) and four environments with continuous action space (Antv2, Half Cheetah-v2, Hopper-v2, Walker2d-v2).
Dataset Splits No The information is insufficient. The paper describes how the offline datasets are composed from different sources (target policy, random policies, lower performance policies) and their sizes, but it does not specify explicit train/validation/test splits.
Hardware Specification No The information is insufficient. The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The information is insufficient. The paper mentions the use of algorithms like DQN and TD3, and the OpenAI Gym suite, but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes The hyperparameters used can be found in Section C. The target policy is set to be the final policy output by DQN or TD3. We also set the feature mapping to be the output of the last hidden layer of the learned value function networks, extracted in the final stage of the online RL methods. For both algorithms, the only hyperparameter is the regularization parameter λ (cf. Algorithm 1), which we choose from {10 1, 10 2, 10 3, 10 4, 10 8}.