Autoregressive Policy Optimization for Constrained Allocation Tasks

Authors: David Winkel, Niklas Strauß, Maximilian Bernhard, Zongyue Li, Thomas Seidl, Matthias Schubert

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the superior performance of our approach compared to a variety of Constrained Reinforcement Learning (CRL) methods on three distinct constrained allocation tasks: portfolio optimization, computational workload distribution, and a synthetic allocation benchmark.
Researcher Affiliation Academia David Winkel Niklas Strauß Maximilian Bernhard Zongyue Li Thomas Seidl Matthias Schubert Munich Center for Machine Learning, LMU Munich {winkel,strauss,bernhard,li,seidl,schubert}@dbs.ifi.lmu.de
Pseudocode Yes Algorithm 1 Maximum likelihood estimation of parameter de-biasing terms
Open Source Code Yes Our code is available at: https://github.com/ niklasdbs/paspo.
Open Datasets Yes We use the environment of [27]...The financial market trajectories in this environment are sampled from a hidden Markov model, which was fitted based on real-world NASDAQ-100 data...The environment is based on the paper of [3].
Dataset Splits No The paper describes evaluation procedures during training (e.g., 'After every 5120 environment steps, we run eight parallel evaluations on 200 fixed trajectories') but does not specify a distinct validation dataset split in percentages or counts typically used for hyperparameter tuning, separate from a final test set.
Hardware Specification Yes Given the relatively small network sizes, training is conducted exclusively on CPUs. We used an internal CPU cluster with consumer machines and servers ranging from 8 to 90 cores and RAM between 32GB and 512GB.
Software Dependencies No The paper states 'We implement our algorithm and the baselines using RLlib and Py Torch' but does not provide specific version numbers for these software components.
Experiment Setup Yes In Table 5 we list the most important parameters and hyperparameters... We use a fully-connected MLP with two hidden layers of 32 units and Re LU non-linearities for each policy, cost, and value function.