Submodular Reinforcement Learning

Authors: Manish Prajapat, Mojmir Mutny, Melanie Zeilinger, Andreas Krause

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We showcase the versatility of our approach by applying SUBPO to several applications such as biodiversity monitoring, Bayesian experiment design, informative path planning, and coverage maximization. Our results demonstrate sample efficiency, as well as scalability to high-dimensional state-action spaces.
Researcher Affiliation Academia Manish Prajapat ETH Zurich Mojmír Mutný ETH Zurich Melanie N. Zeilinger ETH Zurich Andreas Krause ETH Zurich
Pseudocode Yes Algorithm 1 Submodular Policy Optimization (SUBPO)
Open Source Code Yes Code available at https://github.com/manish-pra/non-additive-RL
Open Datasets Yes We simulate a bio-diversity monitoring task, where we aim to cover areas with a high density of gorilla nests with a quadrotor in the Kagwene Gorilla Sanctuary (Fig. 1a). ... Let ρ : V R be the nest density obtained by fitting a smooth rate function (Mutný & Krause, 2021) over Gorilla nest counts (Funwi-gabga & Mateu, 2011). ... For instances where we utilized randomly sampled environments, such as coverage with GP samples, gorilla nest density, or item collection environment, we have included the corresponding environment files in the attached code for easy reference.
Dataset Splits No The paper mentions running experiments, epochs, and multiple runs but does not specify clear training, validation, or test dataset splits with percentages or counts.
Hardware Specification No This takes roughly 1 hour of training for a single-core CPU. ... This takes roughly 6 hours of training for a single-core CPU. The paper mentions 'single-core CPU' but lacks specific details like model number, manufacturer, or clock speed to enable reproduction.
Software Dependencies No We implemented all algorithms in Pytorch and will make the code and the videos public. The paper mentions 'Pytorch' but does not specify a version number or other software dependencies with their versions.
Experiment Setup Yes The agent s policy was parameterized by a two-layer multi-layer perceptron, consisting of 64 neurons in each layer. The non-linearity in the network was induced by employing the Rectified Linear Unit (Re LU) activation function. By employing a stochastic policy, the agent generated a categorical distribution over the action set for each state. Subsequently, this distribution was passed through a softmax probability function. We employed a batch size of B = 500 and a low entropy coefficient of α = 0 or 0.005, depending on the specific characteristics of the environment.