Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces

Authors: Guy Lorberbom, Chris J. Maddison, Nicolas Heess, Tamir Hazan, Daniel Tarlow

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we evaluate the effect of key degrees of freedom and show that the algorithm performs well in illustrative domains compared to baselines. We provide a comprehensive analysis of the new algorithm from both theoretical and empirical perspectives. Section 6: Experiments
Researcher Affiliation Industry Guy Lorberbom, Chris J. Maddison, Nicolas Heess, Tamir Hazan, Daniel Tarlow Google Research, Brain Team
Pseudocode Yes Algorithm 1 Direct Policy Gradient (General Form); Algorithm 2 Top-Down Sampling a
Open Source Code No The paper does not provide any explicit statements or links indicating the availability of open-source code for the described methodology.
Open Datasets Yes Mini Grid is a partially observable grid-world where the agent observes an egocentric 7 7 grid around its current location and has the choice of 7 actions including moving right, left, forward, or toggling doors. We use environments of 25 25 grids with a series of 6 connected rooms separated by doors that need to be opened. [8] Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. Baby AI: First steps towards grounded language learning with a human in the loop. In International Conference on Learning Representations, 2019.
Dataset Splits No The paper describes experimental settings and training procedures (e.g., 'train policies with a range of values for 400,000 episodes') but does not specify dataset splits (e.g., percentages or counts for training, validation, or test sets).
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments (e.g., specific GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In all of the methods we utilized the simulator to reset the environment so that multiple trajectories could be sampled starting from the same environment seed. In all cases, we use a total of 3000 interactions per environment seed (episode). To compute adir, we give a budget of 100 interactions and use priority G (a) in the search, enabling the early termination option in Algorithm 1. In our method, we use 100 interactions to sample aopt (the trajectory length) and 2900 interactions to search for adir. In REINFORCE and in the cross entropy method we sample 30 independent trajectories, where each is 100 interactions long. We explore variations on how to set the priority of nodes in the search for adir. First, in the Gumbel only priority, we use just G (R) as a region s priority. In the others, we use G (R; S, g)+ (L(R)+ U(R)), where U is based on the Manhattan distance to the goal and the number of unopened doors. Setting = 0 trades off enumerating by descending order of G (R; S, g) with favoring prefixes that have already achieved high return. Setting = 1 yields A? search.