Addressing Function Approximation Error in Actor-Critic Methods
Authors: Scott Fujimoto, Herke Hoof, David Meger
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on the suite of Open AI gym tasks, outperforming the state of the art in every environment tested.Given the recent concerns in reproducibility (Henderson et al., 2017), we run our experiments across a large number of seeds with fair evaluation metrics, perform ablation studies across each contribution, and open source both our code and learning curves (https://github.com/ sfujim/TD3). |
| Researcher Affiliation | Academia | 1Mc Gill University, Montreal, Canada 2University of Amsterdam, Amsterdam, Netherlands. |
| Pseudocode | Yes | Algorithm 1 TD3 |
| Open Source Code | Yes | Given the recent concerns in reproducibility (Henderson et al., 2017), we run our experiments across a large number of seeds with fair evaluation metrics, perform ablation studies across each contribution, and open source both our code and learning curves (https://github.com/ sfujim/TD3). |
| Open Datasets | Yes | We evaluate our method on the suite of Open AI gym tasks... interfaced through Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | No | The paper uses continuous control environments where data is generated through interaction, not from a fixed dataset with predefined train/validation/test splits. Therefore, it does not provide explicit dataset split information for reproduction in the traditional sense. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies or libraries used in their implementation. |
| Experiment Setup | Yes | For our implementation of DDPG (Lillicrap et al., 2015), we use a two layer feedforward neural network of 400 and 300 hidden nodes respectively, with rectified linear units (Re LU) between each layer for both the actor and critic, and a final tanh unit following the output of the actor. Both network parameters are updated using Adam (Kingma & Ba, 2014) with a learning rate of 10 3. After each time step, the networks are trained with a mini-batch of a 100 transitions... The target policy smoothing is implemented by adding ϵ N(0, 0.2) to the actions chosen by the target actor network, clipped to ( 0.5, 0.5), delayed policy updates consists of only updating the actor and target critic network every d iterations, with d = 2... Both target networks are updated with τ = 0.005. Afterwards, we use an off-policy exploration strategy, adding Gaussian noise N(0, 0.1) to each action. |