MICo: Improved representations via sampling-based state similarity for Markov decision processes

Authors: Pablo Samuel Castro, Tyler Kastner, Prakash Panangaden, Mark Rowland

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical evidence that learning this distance alongside the value function yields structured and informative representations, including strong results on the Arcade Learning Environment benchmark. and 6 Large-scale empirical evaluation
Researcher Affiliation Collaboration Pablo Samuel Castro Google Research, Brain Team Tyler Kastner Mc Gill University Prakash Panangaden Mc Gill University Mark Rowland Deep Mind
Pseudocode No The paper describes algorithms mathematically (e.g., Equation 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statement about releasing open-source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We evaluated on all 60 Atari 2600 games over 5 seeds and report the results in Figure 1 (left), using the interquantile metric (IQM), proposed by Agarwal et al. [2021b] as a more robust and reliable alternative to mean and median (which are reported in Figure 6). and Additionally, we evaluated the MICo loss on twelve of the DM-Control suite from pixels environments [Tassa et al., 2018].
Dataset Splits No The paper does not explicitly provide specific percentages, sample counts, or detailed methodology for training, validation, and test dataset splits.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or TPU versions) used for running the experiments.
Software Dependencies No The paper mentions 'JAX agents provided in the Dopamine library' and 'Tensor Flow' without providing specific version numbers for these software components.
Experiment Setup Yes For all experiments we used the hyperparameter settings provided with Dopamine. We found that a value of α = 0.5 worked well with quantile-based agents (QR-DQN, IQN, and M-IQN), while a value of α = 0.01 worked well with DQN and Rainbow. and We found it important to use the Huber loss [Huber, 1964] to minimize LMICo as this emphasizes greater accuracy for smaller distances as oppoosed to larger distances.