Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Constructing an Optimal Behavior Basis for the Option Keyboard
Authors: Lucas N. Alegre, Ana Bazzan, Andre Barreto, Bruno Silva
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate our method in challenging high-dimensional RL problems and show that it consistently outperforms state-of-the-art GPI-based approaches. Importantly, we also observe that the performance gain over competing methods becomes more pronounced as the number of reward features increases. (Abstract) |
| Researcher Affiliation | Collaboration | Lucas N. Alegre Institute of Informatics Federal University of Rio Grande do Sul Porto Alegre, RS, Brazil EMAIL Ana L. C. Bazzan Institute of Informatics Federal University of Rio Grande do Sul Porto Alegre, RS, Brazil EMAIL Andrรฉ Barreto Google Deep Mind London, UK EMAIL Bruno C. da Silva University of Massachusetts Amherst, MA, USA EMAIL |
| Pseudocode | Yes | Algorithm 1: Option Keyboard Basis (OKB) [...] Algorithm 2: OK Linear Support (OK-LS) [...] Algorithm 3: Train Option Keyboard (Train OK) |
| Open Source Code | Yes | All the code required to reproduce our experiments is available in the Supplemental Material. |
| Open Datasets | Yes | Figure 1: Domains used in the experiments: Minecart, Fetch Pick And Place, Item Collection, and Highway. [...] We used the implementation available on MO-Gymnasium (Felten et al., 2023). [...] Our implementation of this domain is an adaptation of the one available in Gymnasium-Robotics (de Lazcano et al., 2023). [...] This domain is based on the autonomous driving environment introduced by Leurent (2018). |
| Dataset Splits | Yes | To generate test task sets W W for different values of d, we used the method introduced by Takagi et al. (2020), which produces uniformly spaced weight vectors in W. |
| Hardware Specification | Yes | All experiments were performed in a cluster with NVIDIA A100-PCIE-40GB GPUs with 32GB of RAM. |
| Software Dependencies | No | We used Adam (Kingma and Ba, 2015) as the first-order optimizer used to train all neural networks with mini-batches of size 256. [...] We used pycddlib (https://github.com/ mcmtroffaes/pycddlib) implementation of the Double Description Method (Motzkin et al., 1953). |
| Experiment Setup | Yes | The USFAs ฯ(s, a, w) used for encoding the base policies ฮ k were modeled with multi-layer perceptron (MLP) neural networks with 4 layers with 256 neurons. [...] The meta-policy ฯ(s, w) was modeled with an MLP with 3 layers with 256 neurons. [...] We used Adam (Kingma and Ba, 2015) as the first-order optimizer used to train all neural networks with mini-batches of size 256. [...] The budget of environment interactions per iteration (i.e., call to New Policy in Alg. 1) used was 25000, 50000, 50000 and 100000 for the Minecart, Fetch Pick And Place, Item Collection, and Highway domains, respectively. |