Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization
Authors: Jian-Ting Guo, Yu-Cheng Chen, Ping-Chun Hsieh, Kuo-Hao Ho, Po-Wei Huang, Ti-Rong Wu, I-Chen Wu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at https://rlg.iis.sinica.edu.tw/papers/MAQ. |
| Researcher Affiliation | Academia | 1Department of Computer Science, National Yang Ming Chiao Tung University, Taiwan 2Institute of Information Science, Academia Sinica, Taiwan 3Research Center for Information Technology Innovation, Academia Sinica, Taiwan |
| Pseudocode | No | The paper describes the MAQ framework and its training process using textual descriptions and a high-level diagram (Figure 1), but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Justification: We have provided all necessary information for reproducing our main results. Once the paper is accepted, we will release all codes and trained models. |
| Open Datasets | Yes | We evaluate MAQ on one of the standard RL benchmarks, the Adroit tasks in D4RL [16], which has readily available human demonstrations. |
| Dataset Splits | Yes | Each task includes 25 human demonstration trajectories, which are split into training and testing datasets with a 9:1 ratio. Agents that require human demonstrations, including BC, IQL, and all MAQ-based agents, are trained on the same training datasets. ... Each task consists of 25 successful human demonstrations, which we split into training and testing sets in a 9:1 ratio, resulting in 22 trajectories for training and 3 for testing. |
| Hardware Specification | Yes | We conducted our experiments on a machine equipped with an E5-2678 CPU and four NVIDIA Ge Force GTX 1080 Ti GPUs. |
| Software Dependencies | No | For SAC, we used the default hyperparameters from Stable-Baselines3 [33]. For IQL [29], we adopted the Py Torch implementation by gwthomas s repository released on Github [34], which most closely follows the original paper. For RLPD [6], we used the official implementation released on Git Hub. |
| Experiment Setup | Yes | All agents (including baseline and MAQ-based agents) are trained for a total of 10^6 steps. For IQL and MAQ+IQL, which include both offline and online training stages, we train 10^6 steps to each stage. For MAQ-based agents, we first train a VQVAE on human demonstrations to obtain the macro action codebook. The VQVAE is trained with β = 0.25 in Eq. (4), a hidden size of 256, a batch size of 32, and 100 training episodes. The same trained VQVAE is shared across all MAQ-based agents to ensure consistency. Other detailed hyperparameters are provided in Appendix A. ... Hyperparameters All hyperparameters used in the online RL methods are detailed in Table 2, and those for Conditional VQVAE are listed in Table 3. |