Self-Supervised Reinforcement Learning that Transfers using Random Features
Authors: Boyuan Chen, Chuning Zhu, Pulkit Agrawal, Kaiqing Zhang, Abhishek Gupta
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate that our proposed method enables transfer across tasks on a variety of manipulation and locomotion domains in simulation, opening the door to generalist decision-making agents. |
| Researcher Affiliation | Academia | Boyuan Chen Massachusetts Institute of Technology Boston, MA 02139 boyuanc@mit.edu Chuning Zhu University of Washington Seattle, WA 98105 zchuning@cs.washington.edu Pulkit Agrawal Massachusetts Institute of Technology Boston, MA 02139 pulkitag@mit.edu Kaiqing Zhang University of Maryland College Park, MD 20742 kaiqing@umd.edu Abhishek Gupta University of Washington Seattle, WA 98105 abhgupta@cs.washington.edu |
| Pseudocode | Yes | A Algorithm Pseudocode \ Algorithm 1 Model-Free Transfer with Randomized Cumulants and Model-Predictive Control |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | All our Meta-World [56] domains share the standard Meta-World observation which includes gripper location, and object locations of all possible objects involved in the Meta-World benchmark. [...] Hopper is an environment with a higher dimensional observation of 11 dimensions and an action space dimension of 3. Hopper is a locomotion environment that requires long-horizon reasoning since a wrong action will make it fall down and touch the ground only after some steps. The objective of the original Hopper enviroment in Open AI gym [7] is to train it to run forward. |
| Dataset Splits | No | The paper describes collecting offline datasets and selecting online objectives for test-time adaptation, but it does not specify explicit train/validation/test dataset splits (e.g., percentages or sample counts) for model training or evaluation. It only mentions that "Each method is benchmarked on each domain with 4 seeds." |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or cloud computing instance specifications. |
| Software Dependencies | No | The paper mentions using "SAC [15]" for training policies and refers to a "CNN encoder following the architecture in [32]", but it does not provide specific version numbers for any software, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | We choose the random feature dimension to be 2048. Each dimension in the random feature is extracted by feeding the state-action tuple to a randomly initialized MLP with 2 hidden layers of size of 32. \ During offline training phase, we ensemble 8 instances of MLP with 2 hidden layers of size 4096 and train ψ network following Sec. 3.2. We train ψ network with a learning rate of 3 10 4 on the offline dataset for 4 epochs, with a γ decay of 0.9 and batch size 128. We choose the horizon H to be 16 for Meta-World and Hopper environments and 10 for D Claw. \ During online adaptation phase, we first do random exploration for 2500 steps to collect enough data points for linear regression. \ In each MPC step, we randomly sample 1024 action sequences. We penalize the predicted rewards with the variance of predictions from all 8 ensembles following Sec. B.3. We use an MPPI planner with γ = 10 for the D Claw environment. |