On Representation Complexity of Model-based and Model-free Reinforcement Learning
Authors: Hanlin Zhu, Baihe Huang, Stuart Russell
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal Q-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal Q-function. |
| Researcher Affiliation | Academia | Hanlin Zhu , Baihe Huang , Stuart Russell EECS, UC Berkeley {hanlinzhu,baihe_huang,russell}@berkeley.edu |
| Pseudocode | No | The paper includes diagrams and theoretical descriptions of circuits, but it does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/realgourmet/rep_complexity_rl. |
| Open Datasets | Yes | For common Mu Jo Co Gym environments (Brockman et al., 2016), including Ant-v4, Hopper-v4, Half Cheetah-v4, Inverted Pendulum-v4, and Walker2d-v4 |
| Dataset Splits | No | The paper does not explicitly provide details about specific training, validation, and test splits for the data used in their neural network fitting experiments. It describes hyperparameters for training but not data partitioning. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions 'Optimizer Adam (Kingma & Ba, 2014)' and 'Soft-Actor-Critic (Haarnoja et al., 2018)' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Table 1: Hyperparameter Value(s): Optimizer Adam (Kingma & Ba, 2014), Learning Rate 0.0003, Batch Size 1000, Number of Epochs 100000, Init_temperature 0.1, Episode length 1000, Discount factor 0.99, number of hidden layers (all networks) 256, number of hidden units per layer 2, target update interval 1. Table 2: Hyperparameter Value(s): Optimizer Adam (Kingma & Ba, 2014), Learning Rate 0.001, Batch Size 32, Number of Epochs 100. |