Improving Policy Gradient by Exploring Under-appreciated Rewards

Authors: Ofir Nachum, Mohammad Norouzi, Dale Schuurmans

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the approach on a set of algorithmic tasks that have long challenged RL methods. Our approach reduces hyper-parameter sensitivity and demonstrates significant improvements over baseline methods. The proposed algorithm successfully solves a benchmark multi-digit addition task and generalizes to long sequences, which, to our knowledge, is the first time that a pure RL method has solved addition using only reward feedback.
Researcher Affiliation Collaboration Ofir Nachum , Mohammad Norouzi, Dale Schuurmans Google Brain {ofirnachum, mnorouzi, schuurmans}@google.com Also at the Department of Computing Science, University of Alberta, daes@ualberta.ca
Pseudocode No The paper describes algorithms conceptually but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper cites "Open AI Gym" (gym.openai.com, github.com/openai/gym) as a source for algorithmic tasks and their environments, but it does not provide any concrete access information (e.g., a specific repository link or an explicit statement of code release) for the methodology (UREX) described in this paper.
Open Datasets Yes We assess the effectiveness of the proposed approach on five algorithmic tasks from the Open AI Gym (Brockman et al., 2016), as well as a new binary search problem. Each task is summarized below, with further details available on the Gym website2 or in the corresponding open-source code.3 [Footnote 2: gym.openai.com, Footnote 3: github.com/openai/gym]
Dataset Splits No The paper mentions training jobs, hyperparameter tuning, and evaluation on longer sequences, but it does not provide specific details on training, validation, and test dataset splits (e.g., percentages or sample counts) needed to reproduce the data partitioning.
Hardware Specification No The paper states that "Experiments are conducted using TensorFlow" and describes the model architecture, but it does not provide any specific hardware details (e.g., CPU/GPU models, memory, or number of processors) used for running its experiments.
Software Dependencies No The paper mentions "Tensorflow (Abadi et al., 2016)" but does not specify its version number or any other software dependencies with their specific versions.
Experiment Setup Yes We explore the following main hyper-parameters: The learning rate denoted η chosen from a set of 3 possible values η {0.1, 0.01, 0.001}. The maximum L2 norm of the gradients, beyond which the gradients are clipped. This parameter, denoted c, matters for training RNNs. The value of c is selected from c {1, 10, 40, 100}. The temperature parameter τ that controls the degree of exploration for both MENT and UREX. For MENT, we use τ {0, 0.005, 0.01, 0.1}. For UREX, we only consider τ = 0.1, which consistently performs well across the tasks. [...] The training jobs for Copy, Duplicated Input, Repeat Copy, Reverse, Reversed Addition, and Binary Search are run for 2K, 500, 50K, 5K, 50K, and 2K stochastic gradient steps, respectively. [...] Our policy network comprises a single LSTM layer with 128 nodes. We use the Adam optimizer (Kingma & Ba, 2015) for the experiments. [...] In all of the tasks except Copy, our stochastic optimizer uses mini-batches comprising 400 policy samples from the model. These 400 samples correspond to 40 different random sequences drawn from the environment, and 10 random policy trajectories per sequence. In other words, we set K = 10 and N = 40 as defined in (3) and (12).