A Tractable Inference Perspective of Offline RL

Authors: Xuejie Liu, Anji Liu, Guy Van den Broeck, Yitao Liang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Trifle achieves 7 state-of-the-art scores and the highest average scores in 9 Gym-Mu Jo Co benchmarks against strong baselines. Further, Trifle significantly outperforms prior approaches in stochastic environments and safe RL tasks with minimum algorithmic modifications.
Researcher Affiliation Academia 1Institute for Artificial Intelligence, Peking University 2Computer Science Department, University of California, Los Angeles 3School of Intelligence Science and Technology, Peking University
Pseudocode Yes The main algorithm is illustrated in Algorithm 1, where we take the current state st as well as the past trajectory τ<t as input, utilize the specified value estimate fv as a heuristic to guide beam search, and output the best trajectory. (Referring to Algorithm 1)
Open Source Code Yes Our code is available at https://github.com/liebenxj/Trifle.git
Open Datasets Yes Empirically, Trifle achieves 7 state-of-the-art scores and the highest average scores in 9 Gym-Mu Jo Co benchmarks against strong baselines. Further, Trifle significantly outperforms prior approaches in stochastic environments and safe RL tasks with minimum algorithmic modifications.
Dataset Splits No The paper refers to 'training phase' and 'evaluation phase' for the model, but does not explicitly detail a data split for validation (e.g., percentage or count).
Hardware Specification No The paper mentions training 'on one GPU' but does not specify the make, model, or any other specific hardware details (e.g., 'It only takes 30-60 minutes (~20s per epoch, 100-200 epochs) to train a PC on one GPU').
Software Dependencies No The paper mentions software components like 'GPTs', 'diffusion models', 'Trajectory Transformer', 'Decision Transformer', and 'BERT-like Transformer', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Beam search maintains a set of N (incomplete) sequences each starting as an empty sequence. For ease of presentation, we assume the current time step is 0. At every time step t, beam search replicates each of the N actions sequences into λ Z+ copies and appends an action at to every sequence. Specifically, for every partial action sequence a<t, we sample an action following p(at|s0, a<t, E[Vt] v), where Vt can be either the single-step or the multi-step estimate depending on the task. Now that we have λ N trajectories in total, the next step is to evaluate their expected return, which can be computed exactly using the PC (see Sec. 4.2). The N-best action sequences are kept and proceed to the next time step. After repeating this procedure for H time steps, we return the best action sequence. The first action in the sequence is used to interact with the environment. Please refer to Appx. C for detailed descriptions of the algorithm.