Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces
Authors: Guy Lorberbom, Chris J. Maddison, Nicolas Heess, Tamir Hazan, Daniel Tarlow
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate the effect of key degrees of freedom and show that the algorithm performs well in illustrative domains compared to baselines. We provide a comprehensive analysis of the new algorithm from both theoretical and empirical perspectives. Section 6: Experiments |
| Researcher Affiliation | Industry | Guy Lorberbom, Chris J. Maddison, Nicolas Heess, Tamir Hazan, Daniel Tarlow Google Research, Brain Team |
| Pseudocode | Yes | Algorithm 1 Direct Policy Gradient (General Form); Algorithm 2 Top-Down Sampling a |
| Open Source Code | No | The paper does not provide any explicit statements or links indicating the availability of open-source code for the described methodology. |
| Open Datasets | Yes | Mini Grid is a partially observable grid-world where the agent observes an egocentric 7 7 grid around its current location and has the choice of 7 actions including moving right, left, forward, or toggling doors. We use environments of 25 25 grids with a series of 6 connected rooms separated by doors that need to be opened. [8] Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. Baby AI: First steps towards grounded language learning with a human in the loop. In International Conference on Learning Representations, 2019. |
| Dataset Splits | No | The paper describes experimental settings and training procedures (e.g., 'train policies with a range of values for 400,000 episodes') but does not specify dataset splits (e.g., percentages or counts for training, validation, or test sets). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments (e.g., specific GPU/CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | In all of the methods we utilized the simulator to reset the environment so that multiple trajectories could be sampled starting from the same environment seed. In all cases, we use a total of 3000 interactions per environment seed (episode). To compute adir, we give a budget of 100 interactions and use priority G (a) in the search, enabling the early termination option in Algorithm 1. In our method, we use 100 interactions to sample aopt (the trajectory length) and 2900 interactions to search for adir. In REINFORCE and in the cross entropy method we sample 30 independent trajectories, where each is 100 interactions long. We explore variations on how to set the priority of nodes in the search for adir. First, in the Gumbel only priority, we use just G (R) as a region s priority. In the others, we use G (R; S, g)+ (L(R)+ U(R)), where U is based on the Manhattan distance to the goal and the number of unopened doors. Setting = 0 trades off enumerating by descending order of G (R; S, g) with favoring prefixes that have already achieved high return. Setting = 1 yields A? search. |