Learning One Representation to Optimize All Rewards

Authors: Ahmed Touati, Yann Ollivier

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove viability of the method on several environments from mazes to pixel-based Ms Pacman and a virtual robotic arm. For single-state rewards (learning to reach arbitrary states), we provide quantitative comparisons with goal-oriented methods such as HER. ... We run our experiments on a selection of environments that are diverse in term of state space dimensionality, stochasticity and dynamics.
Researcher Affiliation Collaboration Ahmed Touati Mila, University of Montreal ahmed.touati@umontreal.ca Yann Ollivier Facebook Artificial Intelligence Research Paris yol@fb.com Work done during an internship at Facebook Artificial Intelligence Research Paris.
Pseudocode Yes At each step, a value of z is picked at random, together with a batch of transitions (s0, a0, s1) and a batch of state-actions (s0, a0) from the training set, with (s0, a0) independent from z and (s0, a0, s1). For sampling z, we use a fixed distribution (rescaled Gaussians, see Appendix D). Any number of values of z may be sampled: this does not use up training samples. We use a target network with soft updates (Polyak averaging) as in DDPG. For training we also replace the greedy policy z = arg maxa F(s, a, z)>z with a regularized version z = softmax(F(s, a, z)>z/ ) with fixed temperature (Appendix D). Since there is unidentifiability between F and B (Appendix, Remark 7), we normalize B via an auxiliary loss in Algorithm 1.
Open Source Code Yes Code: https://github.com/ahmed-touati/controllable_agent
Open Datasets No The paper uses environments like "Discrete Maze", "Continuous Maze", "Fetch Reach", and "Ms. Pacman". For Ms. Pacman, it mentions "A variant of the Atari 2600 game Ms. Pacman...[RUMS18]". While these are well-known environments, no specific link, DOI, or explicit access information for the *datasets* used in the experiments (e.g., collected transitions) is provided. The citations are for the environments/games, not the collected data.
Dataset Splits No No specific details about train/validation/test splits (e.g., percentages, sample counts, or explicit references to standard splits) are provided in the paper's main text.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running the experiments are mentioned in the paper.
Software Dependencies No The paper mentions algorithms and techniques such as "DQN", "DDPG", "softmax", and "t-SNE", but does not provide specific software library names with version numbers or programming language versions used for implementation.
Experiment Setup Yes For all environments, we run algorithms for 800 epochs, with three different random seeds. Each epoch consists of 25 cycles where we interleave between gathering some amount of transitions, to add to the replay buffer, and performing 40 steps of stochastic gradient descent on the model parameters. To collect transitions, we generate episodes using some behavior policy. For both mazes, we use a uniform policy while for Fetch Reach and Ms. Pacman, we use an "-greedy policy with respect to the current approximation F(s, a, z)>z for a sampled z. At evaluation time, "-greedy policies are also used, with a smaller ". More details are given in Appendix D.