reproducibilityindex.ai

Addressing Function Approximation Error in Actor-Critic Methods

Authors: Scott Fujimoto, Herke Hoof, David Meger

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on the suite of Open AI gym tasks, outperforming the state of the art in every environment tested.Given the recent concerns in reproducibility (Henderson et al., 2017), we run our experiments across a large number of seeds with fair evaluation metrics, perform ablation studies across each contribution, and open source both our code and learning curves (https://github.com/ sfujim/TD3).
Researcher Affiliation	Academia	1Mc Gill University, Montreal, Canada 2University of Amsterdam, Amsterdam, Netherlands.
Pseudocode	Yes	Algorithm 1 TD3
Open Source Code	Yes	Given the recent concerns in reproducibility (Henderson et al., 2017), we run our experiments across a large number of seeds with fair evaluation metrics, perform ablation studies across each contribution, and open source both our code and learning curves (https://github.com/ sfujim/TD3).
Open Datasets	Yes	We evaluate our method on the suite of Open AI gym tasks... interfaced through Open AI Gym (Brockman et al., 2016).
Dataset Splits	No	The paper uses continuous control environments where data is generated through interaction, not from a fixed dataset with predefined train/validation/test splits. Therefore, it does not provide explicit dataset split information for reproduction in the traditional sense.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies or libraries used in their implementation.
Experiment Setup	Yes	For our implementation of DDPG (Lillicrap et al., 2015), we use a two layer feedforward neural network of 400 and 300 hidden nodes respectively, with rectiﬁed linear units (Re LU) between each layer for both the actor and critic, and a ﬁnal tanh unit following the output of the actor. Both network parameters are updated using Adam (Kingma & Ba, 2014) with a learning rate of 10 3. After each time step, the networks are trained with a mini-batch of a 100 transitions... The target policy smoothing is implemented by adding ϵ N(0, 0.2) to the actions chosen by the target actor network, clipped to ( 0.5, 0.5), delayed policy updates consists of only updating the actor and target critic network every d iterations, with d = 2... Both target networks are updated with τ = 0.005. Afterwards, we use an off-policy exploration strategy, adding Gaussian noise N(0, 0.1) to each action.