Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
$q$-exponential family for policy optimization
Authors: Lingwei Zhu, Haseeb Shah, Han Wang, Yukie Nagai, Martha White
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide comprehensive experiments on both online and offline problems showing that q-exponential family policies can improve on the Gaussian by a large margin. In particular, we find that the Student s t policy is more stable, performing well across algorithms and problems, shown in Figure 2. We ran experiments with different algorithms, to get a better sense of how conclusions about policy parameterization vary across different actor-critic algorithms. |
| Researcher Affiliation | Academia | Lingwei Zhu University of Tokyo EMAIL Haseeb Shah University of Alberta EMAIL Han Wang University of Alberta EMAIL Yukie Nagai University of Tokyo Martha White University of Alberta |
| Pseudocode | Yes | Algorithm 1: q-Gaussian sampling Algorithm 2: Out-of-support action handling for the light-tailed q-Gaussian |
| Open Source Code | Yes | Our code is available at https://github.com/lingweizhu/qexp. |
| Open Datasets | Yes | We used the standard benchmark Mu Jo Co suite from D4RL to evaluate algorithm-policy combinations (Fu et al., 2020). The D4RL offline datasets all contain 1 million samples generated by a partially trained SAC agent. |
| Dataset Splits | No | The paper describes the composition of the D4RL datasets (Medium-Replay, Medium, Medium-Expert) and how many samples they contain, but does not specify explicit train/test/validation splits for their own experiments beyond using these named datasets as distinct experimental settings. For online experiments, it details evaluation procedures (e.g., averaging over 3 or 1 episode) rather than dataset splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch (Paszke et al., 2019)' but does not specify a version number for PyTorch or any other software component used in the experiments. |
| Experiment Setup | Yes | D.2 ONLINE EXPERIMENTS: We used a 2-layer network with 64 nodes on each layer and Re LU non-linearities. The batch size was 32. Agents used a target network for the critic, updated with polyak averaging with α = 0.01. Table 4: Default hyperparameters and sweeping choices for online experiments. D.3 OFFLINE EXPERIMENTS: We used a 2-layer network with 256 nodes on each layer. The batch size was 256. Agents used a target network for the critic, updated with polyak averaging with α = 0.005. The discount rate was set to 0.99. Table 5: Default hyperparameters and sweeping choices for offline experiments. |