Learning non-Markovian Decision-Making from State-only Sequences

Authors: Aoyang Qin, Feng Gao, Qing Li, Song-Chun Zhu, Sirui Xie

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of the proposed method in a prototypical path planning task with non-Markovian constraints and show that the learned model exhibits strong performances in challenging domains from the Mu Jo Co suite. (...) We also test the proposed modeling, learning, and computing method in Mu Jo Co, a domain with higher-dimensional state and action spaces, and achieve performance competitive to existing methods, even those that learn with action labels.
Researcher Affiliation Academia Aoyang Qin ,1,2 Feng Gao3 Qing Li2 Song-Chun Zhu1,2,4 Sirui Xie ,5 1 Department of Automation, Tsinghua University 2 Beijing Institute for General Artificial Intelligence (BIGAI) 3 Department of Statistics, UCLA 4 School of Artificial Intelligence, Peking University 5 Department of Computer Science, UCLA
Pseudocode Yes Algorithm 1: Lan MDP without importance sampling (...) Algorithm 2: Lan MDP with importance sampling
Open Source Code Yes Code and data are available at https://github.com/qayqaq/LanMDP
Open Datasets No The paper uses MuJoCo control tasks (Cartpole-v1, Reacher-v2, Swimmer-v3, Hopper-v2, Walker2d-v2) and generated its own demonstrations: 'We train an expert for each task using PPO [45]. They are then used to generate 10 trajectories for each task as demonstrations.' However, it does not provide access information (link, DOI, citation with author/year) to these generated datasets.
Dataset Splits No The paper states 'Results for context length 1 are illustrated through learning curves and a bar plot in Fig. 4. These learning curves are the average progress across 5 seeds.' and 'We report the mean of the best performance achieved by BC/BCO with five random seeds'. While this implies some form of splitting or cross-validation for robustness, explicit details about train/validation/test splits (e.g., percentages, sample counts, or references to predefined splits with citations) are not provided for reproducibility.
Hardware Specification Yes All benchmarking is performed using a single 3090Ti GPU and implemented using the Py Torch framework.
Software Dependencies Yes All benchmarking is performed using a single 3090Ti GPU and implemented using the Py Torch framework. (...) We leverage PPO [45] approach to train the expert policy...
Experiment Setup Yes Hyper-parameters are listed in Table 3. (...) The input and output dimensions are adapted to the state and action spaces in different tasks, and so are short-run sampling steps. Sequential contexts are extracted from stored episodic memory. The number of neurons in the input and hidden layer in the policy MLP varies according to the context length. We use replay buffers to store the self-interaction experiences for training the transition model offline. See Appendix D for detailed information on network architectures and hyper-parameters.