Addressing Function Approximation Error in Actor-Critic Methods

Authors: Scott Fujimoto, Herke Hoof, David Meger

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on the suite of Open AI gym tasks, outperforming the state of the art in every environment tested.Given the recent concerns in reproducibility (Henderson et al., 2017), we run our experiments across a large number of seeds with fair evaluation metrics, perform ablation studies across each contribution, and open source both our code and learning curves (https://github.com/ sfujim/TD3).
Researcher Affiliation Academia 1Mc Gill University, Montreal, Canada 2University of Amsterdam, Amsterdam, Netherlands.
Pseudocode Yes Algorithm 1 TD3
Open Source Code Yes Given the recent concerns in reproducibility (Henderson et al., 2017), we run our experiments across a large number of seeds with fair evaluation metrics, perform ablation studies across each contribution, and open source both our code and learning curves (https://github.com/ sfujim/TD3).
Open Datasets Yes We evaluate our method on the suite of Open AI gym tasks... interfaced through Open AI Gym (Brockman et al., 2016).
Dataset Splits No The paper uses continuous control environments where data is generated through interaction, not from a fixed dataset with predefined train/validation/test splits. Therefore, it does not provide explicit dataset split information for reproduction in the traditional sense.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used in their implementation.
Experiment Setup Yes For our implementation of DDPG (Lillicrap et al., 2015), we use a two layer feedforward neural network of 400 and 300 hidden nodes respectively, with rectified linear units (Re LU) between each layer for both the actor and critic, and a final tanh unit following the output of the actor. Both network parameters are updated using Adam (Kingma & Ba, 2014) with a learning rate of 10 3. After each time step, the networks are trained with a mini-batch of a 100 transitions... The target policy smoothing is implemented by adding ϵ N(0, 0.2) to the actions chosen by the target actor network, clipped to ( 0.5, 0.5), delayed policy updates consists of only updating the actor and target critic network every d iterations, with d = 2... Both target networks are updated with τ = 0.005. Afterwards, we use an off-policy exploration strategy, adding Gaussian noise N(0, 0.1) to each action.