Constrained Reinforcement Learning Has Zero Duality Gap

Authors: Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, Alejandro Ribeiro

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we include a numerical example in order to showcase the consequences of our theoretical results. As an illustrative example, we consider a gridworld navigation scenario. This scenario, illustrated in Figure 1, consists of an agent attempting to navigate from a starting position to a goal. [...] We train the agent via Algorithm 1, agent and plot in Fig. 2, the resulting normalized duality gap. We consider two cases, an inexact primal maximization via policy gradient and, an exact primal maximization. [...] We show that by solving Step 4 of Algorithm 1 exactly the duality gap effectively vanishes (red curve). [...] Figure 3 displays the effect of using coarser parametrizations, as the parametrization becomes coarser, the duality gap increases (as per Theorem 2).
Researcher Affiliation Academia Santiago Paternain, Luiz F. O. Chamon, Miguel Calvo-Fullana and Alejandro Ribeiro Electrical and Systems Engineering University of Pennsylvania {spater,luizf,cfullana,aribeiro}@seas.upenn.edu
Pseudocode Yes Algorithm 1 dual Descent 1: Initialize: θ0 = 0, λ0 = 0 2: for k = 0, 1 . . . 3: Compute an approximation of θk+1 argmax Lθ(θ, λk) with a RL algorithm 4: Compute the dual ascent step λk+1 = [λk η (V (θk+1) s)]+. 5: end
Open Source Code No The paper does not include any statement about releasing source code or provide links to a code repository.
Open Datasets No The paper uses a custom “gridworld navigation scenario” for its numerical example, but does not provide access information (link, citation, or repository) for a publicly available or open dataset.
Dataset Splits No The paper describes a gridworld scenario and the training process but does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not specify any hardware details (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using “policy gradient” or “actor-critic methods” as algorithms but does not provide specific software names with version numbers for implementation dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup No The paper describes the agent’s policy type (softmax with table-lookup) and the constraint applied (“not cross the unsafe bridge with 99% probability”), but does not provide specific numerical hyperparameters (e.g., learning rates, batch sizes, number of epochs, optimizer details) or other explicit system-level training configurations. While a step-size η is mentioned for Algorithm 1, its specific value is not provided.