Bridging State and History Representations: Understanding Self-Predictive RL
Authors: Tianwei Ni, Benjamin Eysenbach, Erfan SeyedSalehi, Michel Ma, Clement Gehring, Aditya Mahajan, Pierre-Luc Bacon
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experimentation across three benchmarks (Sec. 5), we provide empirical evidence substantiating all our theoretical predictions using our simple algorithm. |
| Researcher Affiliation | Collaboration | |Mila, Université de Montréal, Princeton University, }Mila, Mc Gill University {tianwei.ni, michel.ma, clement.gehring, pierre-luc.bacon}@mila.quebec, eysenbach@princeton.edu, erfan.seyedsalehi@mail.mcgill.ca, aditya.mahajan@mcgill.ca |
| Pseudocode | Yes | Algo. 1 provides the pseudocode for the update rule of all parameters in our algorithm given a tuple of transition data, with Py Torch code included in Appendix. |
| Open Source Code | Yes | Algo. 1 provides the pseudocode for the update rule of all parameters in our algorithm given a tuple of transition data, with Py Torch code included in Appendix. |
| Open Datasets | Yes | We conduct experiments to compare RL agents learning the three representations {φQ , φL, φO}, respectively. To decouple representation learning from policy optimization, we follow our minimalist algorithm (Algo. 1) to learn φL, and instantiate φQ and φO by setting λ = 0 and replacing ZP loss with OP loss, as we discuss in Sec. 4.3. We evaluate the algorithms in standard MDPs, distracting MDPs11, and sparse-reward POMDPs. The experimental details are shown in Sec. E. |
| Dataset Splits | No | We conduct experiments to compare RL agents learning the three representations {φQ , φL, φO}, respectively. To decouple representation learning from policy optimization, we follow our minimalist algorithm (Algo. 1) to learn φL, and instantiate φQ and φO by setting λ = 0 and replacing ZP loss with OP loss, as we discuss in Sec. 4.3. We evaluate the algorithms in standard MDPs, distracting MDPs11, and sparse-reward POMDPs. The experimental details are shown in Sec. E. |
| Hardware Specification | No | This work was enabled by the computational resources provided by the Calcul Québec (www.calculquebec.ca) and the Digital Research Alliance of Canada (https://alliancecan.ca/), with material support from NVIDIA Corporation. |
| Software Dependencies | No | Our code is written in PyTorch (Paszke et al., 2019). |
| Experiment Setup | Yes | The experimental details are shown in Sec. E. |