Discount Factor as a Regularizer in Reinforcement Learning

Authors: Ron Amit, Ron Meir, Kamil Ciosek

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically study this technique compared to standard L2 regularization by extensive experiments in discrete and continuous domains, using tabular and functional representations. Our experiments suggest the regularization effectiveness is strongly related to properties of the available data, such as size, distribution, and mixing rate.
Researcher Affiliation Collaboration 1The Viterbi Faculty of Electrical Engineering, Technion Israel Institute of Technology, Haifa, Israel 2Microsoft Research, Cambridge, UK. Correspondence to: Ron Amit <ronamit@campus.technion.ac.il>, Ron Meir <rmeir@ee.technion.ac.il>, Kamil Ciosek <Kamil.Ciosek@microsoft.com>.
Pseudocode Yes Algorithm 1 Generic Regularized Batch TD(0)
Open Source Code Yes Code for all the experiments is available at: https://github.com/ron-amit/Discount_as_ Regularizer.
Open Datasets Yes Our experiments use the Mujoco environment (Todorov et al., 2012). To test the ability to generalize from finite data, we limited the number of time-steps from the environment to 200,000 or less.
Dataset Splits No The paper mentions 'limited amount of training data' and 'finite data setting' and uses phrases like 'evaluation episodes' but does not specify explicit numerical training, validation, or test dataset splits (e.g., percentages or counts).
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. It only mentions general settings like 'deep learning settings'.
Software Dependencies No The paper mentions algorithms used (e.g., TD3, DDPG, PPO, DQN, Adam) and environments (e.g., Mujoco), but it does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes All hyper-parameters are identical to those suggested by (Fujimoto et al., 2018) except the following changes. We tested with several amounts of total time-steps to simulate a limited data setting. As in Fujimoto et al. (2018), The first 104 time steps are used only for exploration. Another change to improve learning stability is increasing the batch size from 100 to 256. See Appendix A.7 for the complete implementation details.