Mimicking Better by Matching the Approximate Action Distribution

Authors: Joao Candido Ramos, Lionel Blondé, Naoya Takeishi, Alexandros Kalousis

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate its effectiveness in a number of Mu Jo Co environments, both int the Open AI Gym and the Deep Mind Control Suite. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
Researcher Affiliation Academia 1University of Geneva (UNIGE), Switzerland 2University of Applied Sciences and Arts Western (HES-SO), Switzerland 3The University of Tokyo, Japan 4RIKEN Center for Advanced Intelligence Project, Japan.
Pseudocode Yes Algorithm 1 Mimicking Better by Matching the Approximate Action Distribution (MAAD)
Open Source Code Yes our code is openly available:https://github.com/jacr13/MAAD.
Open Datasets Yes We demonstrate its effectiveness in a number of Mu Jo Co environments, both int the Open AI Gym and the Deep Mind Control Suite. We collected expert trajectories from a policy trained using PPO (Schulman et al., 2017) on each Mu Jo Co task. Then we used the collected trajectories to train several imitation learning baseline models and compare them against different flavors of our model. Table 2 provides a description of the state and action spaces of Mu Jo Co environments, along with the number and length of expert trajectories used to train our models.
Dataset Splits No The paper does not explicitly provide information on validation dataset splits, such as percentages or sample counts for a distinct validation set.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud computing instances) used for running the experiments.
Software Dependencies No We implemented all the algorithms investigated and reported in Py Torch, maintaining a similar structure and keeping the same hyperparameters as much as possible. We used PPO (Schulman et al., 2017) as the underlying reinforcement learning algorithm. No version numbers are provided for PyTorch or other libraries/frameworks.
Experiment Setup Yes Table 3 provides a comprehensive list of the hyperparameters used for each of the evaluated algorithms in Section 5. Parameter Value Shared Batch size 64 Rollout length 2048 Discount γ 0.99 π architecture {MLP [128,128], MLP [256,256]} π Learning rate 10^-4 π updates {3,6,9} PPO ϵ {0.1, 0.2}.