Self-Supervised Reinforcement Learning that Transfers using Random Features

Authors: Boyuan Chen, Chuning Zhu, Pulkit Agrawal, Kaiqing Zhang, Abhishek Gupta

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate that our proposed method enables transfer across tasks on a variety of manipulation and locomotion domains in simulation, opening the door to generalist decision-making agents.
Researcher Affiliation Academia Boyuan Chen Massachusetts Institute of Technology Boston, MA 02139 boyuanc@mit.edu Chuning Zhu University of Washington Seattle, WA 98105 zchuning@cs.washington.edu Pulkit Agrawal Massachusetts Institute of Technology Boston, MA 02139 pulkitag@mit.edu Kaiqing Zhang University of Maryland College Park, MD 20742 kaiqing@umd.edu Abhishek Gupta University of Washington Seattle, WA 98105 abhgupta@cs.washington.edu
Pseudocode Yes A Algorithm Pseudocode \ Algorithm 1 Model-Free Transfer with Randomized Cumulants and Model-Predictive Control
Open Source Code No The paper does not contain any explicit statement about releasing the code for the described methodology or a link to a code repository.
Open Datasets Yes All our Meta-World [56] domains share the standard Meta-World observation which includes gripper location, and object locations of all possible objects involved in the Meta-World benchmark. [...] Hopper is an environment with a higher dimensional observation of 11 dimensions and an action space dimension of 3. Hopper is a locomotion environment that requires long-horizon reasoning since a wrong action will make it fall down and touch the ground only after some steps. The objective of the original Hopper enviroment in Open AI gym [7] is to train it to run forward.
Dataset Splits No The paper describes collecting offline datasets and selecting online objectives for test-time adaptation, but it does not specify explicit train/validation/test dataset splits (e.g., percentages or sample counts) for model training or evaluation. It only mentions that "Each method is benchmarked on each domain with 4 seeds."
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or cloud computing instance specifications.
Software Dependencies No The paper mentions using "SAC [15]" for training policies and refers to a "CNN encoder following the architecture in [32]", but it does not provide specific version numbers for any software, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow, etc.).
Experiment Setup Yes We choose the random feature dimension to be 2048. Each dimension in the random feature is extracted by feeding the state-action tuple to a randomly initialized MLP with 2 hidden layers of size of 32. \ During offline training phase, we ensemble 8 instances of MLP with 2 hidden layers of size 4096 and train ψ network following Sec. 3.2. We train ψ network with a learning rate of 3 10 4 on the offline dataset for 4 epochs, with a γ decay of 0.9 and batch size 128. We choose the horizon H to be 16 for Meta-World and Hopper environments and 10 for D Claw. \ During online adaptation phase, we first do random exploration for 2500 steps to collect enough data points for linear regression. \ In each MPC step, we randomly sample 1024 action sequences. We penalize the predicted rewards with the variance of predictions from all 8 ensembles following Sec. B.3. We use an MPPI planner with γ = 10 for the D Claw environment.