reproducibilityindex.ai

Instabilities of Offline RL with Pre-Trained Neural Representation

Authors: Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, Sham Kakade

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on a range of tasks, we see that substantial error ampliﬁcation does occur even when using such pre-trained representations (trained on the same task itself); we ﬁnd ofﬂine RL is stable only under extremely mild distribution shift. The goal of our experimental evaluation is to understand whether ofﬂine RL methods are sensitive to distribution shift in practical tasks, given a good representation (features extracted from pre-trained neural networks or random features).
Researcher Affiliation	Collaboration	Ruosong Wang 1 Yifan Wu 1 Ruslan Salakhutdinov 1 Sham M. Kakade 2 3 1Carnegie Mellon University, Pittsburgh, PA, USA 2Microsoft Research, New York, NY, USA 3University of Washington, Seattle, WA, USA.
Pseudocode	Yes	Algorithm 1 Fitted Q-Iteration (FQI) 1: Input: policy to be evaluated, number of samples N, regularization parameter λ > 0, number of rounds T 2: Take samples (si, ai) µ, ri r(si, ai) and si P(si, ai) for each i 2 [N] 3: ˆ = 1 i2[N] φ(si, ai)φ(si, ai)> + λI 4: Q0( , ) = 0 and V0( ) = 0 5: for t = 1, 2, . . . , T do 6: ˆ t = ˆ 1( 1 i=1 φ(si, ai) (ri + γ ˆVt 1(si))) 7: ˆQt( , ) = φ( , )>ˆ t and ˆVt( ) = ˆQt( , ( )) 8: end for 9: return ˆQT ( , )
Open Source Code	No	The information is insufficient. The paper does not provide any statement or link indicating that the source code for their methodology is publicly available.
Open Datasets	Yes	Our experiments are performed on a range of challenging tasks from the Open AI gym benchmark suite (Brockman et al., 2016), including two environments with discrete action space (Mountain Car-v0, Cart Pole-v0) and four environments with continuous action space (Antv2, Half Cheetah-v2, Hopper-v2, Walker2d-v2).
Dataset Splits	No	The information is insufficient. The paper describes how the offline datasets are composed from different sources (target policy, random policies, lower performance policies) and their sizes, but it does not specify explicit train/validation/test splits.
Hardware Specification	No	The information is insufficient. The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The information is insufficient. The paper mentions the use of algorithms like DQN and TD3, and the OpenAI Gym suite, but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup	Yes	The hyperparameters used can be found in Section C. The target policy is set to be the ﬁnal policy output by DQN or TD3. We also set the feature mapping to be the output of the last hidden layer of the learned value function networks, extracted in the ﬁnal stage of the online RL methods. For both algorithms, the only hyperparameter is the regularization parameter λ (cf. Algorithm 1), which we choose from {10 1, 10 2, 10 3, 10 4, 10 8}.