Bridging State and History Representations: Understanding Self-Predictive RL

Authors: Tianwei Ni, Benjamin Eysenbach, Erfan SeyedSalehi, Michel Ma, Clement Gehring, Aditya Mahajan, Pierre-Luc Bacon

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experimentation across three benchmarks (Sec. 5), we provide empirical evidence substantiating all our theoretical predictions using our simple algorithm.
Researcher Affiliation Collaboration |Mila, Université de Montréal, Princeton University, }Mila, Mc Gill University {tianwei.ni, michel.ma, clement.gehring, pierre-luc.bacon}@mila.quebec, eysenbach@princeton.edu, erfan.seyedsalehi@mail.mcgill.ca, aditya.mahajan@mcgill.ca
Pseudocode Yes Algo. 1 provides the pseudocode for the update rule of all parameters in our algorithm given a tuple of transition data, with Py Torch code included in Appendix.
Open Source Code Yes Algo. 1 provides the pseudocode for the update rule of all parameters in our algorithm given a tuple of transition data, with Py Torch code included in Appendix.
Open Datasets Yes We conduct experiments to compare RL agents learning the three representations {φQ , φL, φO}, respectively. To decouple representation learning from policy optimization, we follow our minimalist algorithm (Algo. 1) to learn φL, and instantiate φQ and φO by setting λ = 0 and replacing ZP loss with OP loss, as we discuss in Sec. 4.3. We evaluate the algorithms in standard MDPs, distracting MDPs11, and sparse-reward POMDPs. The experimental details are shown in Sec. E.
Dataset Splits No We conduct experiments to compare RL agents learning the three representations {φQ , φL, φO}, respectively. To decouple representation learning from policy optimization, we follow our minimalist algorithm (Algo. 1) to learn φL, and instantiate φQ and φO by setting λ = 0 and replacing ZP loss with OP loss, as we discuss in Sec. 4.3. We evaluate the algorithms in standard MDPs, distracting MDPs11, and sparse-reward POMDPs. The experimental details are shown in Sec. E.
Hardware Specification No This work was enabled by the computational resources provided by the Calcul Québec (www.calculquebec.ca) and the Digital Research Alliance of Canada (https://alliancecan.ca/), with material support from NVIDIA Corporation.
Software Dependencies No Our code is written in PyTorch (Paszke et al., 2019).
Experiment Setup Yes The experimental details are shown in Sec. E.