reproducibilityindex.ai

Bridging State and History Representations: Understanding Self-Predictive RL

Authors: Tianwei Ni, Benjamin Eysenbach, Erfan SeyedSalehi, Michel Ma, Clement Gehring, Aditya Mahajan, Pierre-Luc Bacon

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experimentation across three benchmarks (Sec. 5), we provide empirical evidence substantiating all our theoretical predictions using our simple algorithm.
Researcher Affiliation	Collaboration	\|Mila, Université de Montréal, Princeton University, }Mila, Mc Gill University {tianwei.ni, michel.ma, clement.gehring, pierre-luc.bacon}@mila.quebec, eysenbach@princeton.edu, erfan.seyedsalehi@mail.mcgill.ca, aditya.mahajan@mcgill.ca
Pseudocode	Yes	Algo. 1 provides the pseudocode for the update rule of all parameters in our algorithm given a tuple of transition data, with Py Torch code included in Appendix.
Open Source Code	Yes	Algo. 1 provides the pseudocode for the update rule of all parameters in our algorithm given a tuple of transition data, with Py Torch code included in Appendix.
Open Datasets	Yes	We conduct experiments to compare RL agents learning the three representations {φQ , φL, φO}, respectively. To decouple representation learning from policy optimization, we follow our minimalist algorithm (Algo. 1) to learn φL, and instantiate φQ and φO by setting λ = 0 and replacing ZP loss with OP loss, as we discuss in Sec. 4.3. We evaluate the algorithms in standard MDPs, distracting MDPs11, and sparse-reward POMDPs. The experimental details are shown in Sec. E.
Dataset Splits	No	We conduct experiments to compare RL agents learning the three representations {φQ , φL, φO}, respectively. To decouple representation learning from policy optimization, we follow our minimalist algorithm (Algo. 1) to learn φL, and instantiate φQ and φO by setting λ = 0 and replacing ZP loss with OP loss, as we discuss in Sec. 4.3. We evaluate the algorithms in standard MDPs, distracting MDPs11, and sparse-reward POMDPs. The experimental details are shown in Sec. E.
Hardware Specification	No	This work was enabled by the computational resources provided by the Calcul Québec (www.calculquebec.ca) and the Digital Research Alliance of Canada (https://alliancecan.ca/), with material support from NVIDIA Corporation.
Software Dependencies	No	Our code is written in PyTorch (Paszke et al., 2019).
Experiment Setup	Yes	The experimental details are shown in Sec. E.