Submodular Reinforcement Learning
Authors: Manish Prajapat, Mojmir Mutny, Melanie Zeilinger, Andreas Krause
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase the versatility of our approach by applying SUBPO to several applications such as biodiversity monitoring, Bayesian experiment design, informative path planning, and coverage maximization. Our results demonstrate sample efficiency, as well as scalability to high-dimensional state-action spaces. |
| Researcher Affiliation | Academia | Manish Prajapat ETH Zurich Mojmír Mutný ETH Zurich Melanie N. Zeilinger ETH Zurich Andreas Krause ETH Zurich |
| Pseudocode | Yes | Algorithm 1 Submodular Policy Optimization (SUBPO) |
| Open Source Code | Yes | Code available at https://github.com/manish-pra/non-additive-RL |
| Open Datasets | Yes | We simulate a bio-diversity monitoring task, where we aim to cover areas with a high density of gorilla nests with a quadrotor in the Kagwene Gorilla Sanctuary (Fig. 1a). ... Let ρ : V R be the nest density obtained by fitting a smooth rate function (Mutný & Krause, 2021) over Gorilla nest counts (Funwi-gabga & Mateu, 2011). ... For instances where we utilized randomly sampled environments, such as coverage with GP samples, gorilla nest density, or item collection environment, we have included the corresponding environment files in the attached code for easy reference. |
| Dataset Splits | No | The paper mentions running experiments, epochs, and multiple runs but does not specify clear training, validation, or test dataset splits with percentages or counts. |
| Hardware Specification | No | This takes roughly 1 hour of training for a single-core CPU. ... This takes roughly 6 hours of training for a single-core CPU. The paper mentions 'single-core CPU' but lacks specific details like model number, manufacturer, or clock speed to enable reproduction. |
| Software Dependencies | No | We implemented all algorithms in Pytorch and will make the code and the videos public. The paper mentions 'Pytorch' but does not specify a version number or other software dependencies with their versions. |
| Experiment Setup | Yes | The agent s policy was parameterized by a two-layer multi-layer perceptron, consisting of 64 neurons in each layer. The non-linearity in the network was induced by employing the Rectified Linear Unit (Re LU) activation function. By employing a stochastic policy, the agent generated a categorical distribution over the action set for each state. Subsequently, this distribution was passed through a softmax probability function. We employed a batch size of B = 500 and a low entropy coefficient of α = 0 or 0.005, depending on the specific characteristics of the environment. |