Autoregressive Policy Optimization for Constrained Allocation Tasks
Authors: David Winkel, Niklas Strauß, Maximilian Bernhard, Zongyue Li, Thomas Seidl, Matthias Schubert
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the superior performance of our approach compared to a variety of Constrained Reinforcement Learning (CRL) methods on three distinct constrained allocation tasks: portfolio optimization, computational workload distribution, and a synthetic allocation benchmark. |
| Researcher Affiliation | Academia | David Winkel Niklas Strauß Maximilian Bernhard Zongyue Li Thomas Seidl Matthias Schubert Munich Center for Machine Learning, LMU Munich {winkel,strauss,bernhard,li,seidl,schubert}@dbs.ifi.lmu.de |
| Pseudocode | Yes | Algorithm 1 Maximum likelihood estimation of parameter de-biasing terms |
| Open Source Code | Yes | Our code is available at: https://github.com/ niklasdbs/paspo. |
| Open Datasets | Yes | We use the environment of [27]...The financial market trajectories in this environment are sampled from a hidden Markov model, which was fitted based on real-world NASDAQ-100 data...The environment is based on the paper of [3]. |
| Dataset Splits | No | The paper describes evaluation procedures during training (e.g., 'After every 5120 environment steps, we run eight parallel evaluations on 200 fixed trajectories') but does not specify a distinct validation dataset split in percentages or counts typically used for hyperparameter tuning, separate from a final test set. |
| Hardware Specification | Yes | Given the relatively small network sizes, training is conducted exclusively on CPUs. We used an internal CPU cluster with consumer machines and servers ranging from 8 to 90 cores and RAM between 32GB and 512GB. |
| Software Dependencies | No | The paper states 'We implement our algorithm and the baselines using RLlib and Py Torch' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | In Table 5 we list the most important parameters and hyperparameters... We use a fully-connected MLP with two hidden layers of 32 units and Re LU non-linearities for each policy, cost, and value function. |