reproducibilityindex.ai

Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces

Authors: Guy Lorberbom, Chris J. Maddison, Nicolas Heess, Tamir Hazan, Daniel Tarlow

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we evaluate the effect of key degrees of freedom and show that the algorithm performs well in illustrative domains compared to baselines. We provide a comprehensive analysis of the new algorithm from both theoretical and empirical perspectives. Section 6: Experiments
Researcher Affiliation	Industry	Guy Lorberbom, Chris J. Maddison, Nicolas Heess, Tamir Hazan, Daniel Tarlow Google Research, Brain Team
Pseudocode	Yes	Algorithm 1 Direct Policy Gradient (General Form); Algorithm 2 Top-Down Sampling a
Open Source Code	No	The paper does not provide any explicit statements or links indicating the availability of open-source code for the described methodology.
Open Datasets	Yes	Mini Grid is a partially observable grid-world where the agent observes an egocentric 7 7 grid around its current location and has the choice of 7 actions including moving right, left, forward, or toggling doors. We use environments of 25 25 grids with a series of 6 connected rooms separated by doors that need to be opened. [8] Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. Baby AI: First steps towards grounded language learning with a human in the loop. In International Conference on Learning Representations, 2019.
Dataset Splits	No	The paper describes experimental settings and training procedures (e.g., 'train policies with a range of values for 400,000 episodes') but does not specify dataset splits (e.g., percentages or counts for training, validation, or test sets).
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments (e.g., specific GPU/CPU models, memory, or cloud instance types).
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	In all of the methods we utilized the simulator to reset the environment so that multiple trajectories could be sampled starting from the same environment seed. In all cases, we use a total of 3000 interactions per environment seed (episode). To compute adir, we give a budget of 100 interactions and use priority G (a) in the search, enabling the early termination option in Algorithm 1. In our method, we use 100 interactions to sample aopt (the trajectory length) and 2900 interactions to search for adir. In REINFORCE and in the cross entropy method we sample 30 independent trajectories, where each is 100 interactions long. We explore variations on how to set the priority of nodes in the search for adir. First, in the Gumbel only priority, we use just G (R) as a region s priority. In the others, we use G (R; S, g)+ (L(R)+ U(R)), where U is based on the Manhattan distance to the goal and the number of unopened doors. Setting = 0 trades off enumerating by descending order of G (R; S, g) with favoring preﬁxes that have already achieved high return. Setting = 1 yields A? search.