Learning One Representation to Optimize All Rewards
Authors: Ahmed Touati, Yann Ollivier
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove viability of the method on several environments from mazes to pixel-based Ms Pacman and a virtual robotic arm. For single-state rewards (learning to reach arbitrary states), we provide quantitative comparisons with goal-oriented methods such as HER. ... We run our experiments on a selection of environments that are diverse in term of state space dimensionality, stochasticity and dynamics. |
| Researcher Affiliation | Collaboration | Ahmed Touati Mila, University of Montreal ahmed.touati@umontreal.ca Yann Ollivier Facebook Artificial Intelligence Research Paris yol@fb.com Work done during an internship at Facebook Artificial Intelligence Research Paris. |
| Pseudocode | Yes | At each step, a value of z is picked at random, together with a batch of transitions (s0, a0, s1) and a batch of state-actions (s0, a0) from the training set, with (s0, a0) independent from z and (s0, a0, s1). For sampling z, we use a fixed distribution (rescaled Gaussians, see Appendix D). Any number of values of z may be sampled: this does not use up training samples. We use a target network with soft updates (Polyak averaging) as in DDPG. For training we also replace the greedy policy z = arg maxa F(s, a, z)>z with a regularized version z = softmax(F(s, a, z)>z/ ) with fixed temperature (Appendix D). Since there is unidentifiability between F and B (Appendix, Remark 7), we normalize B via an auxiliary loss in Algorithm 1. |
| Open Source Code | Yes | Code: https://github.com/ahmed-touati/controllable_agent |
| Open Datasets | No | The paper uses environments like "Discrete Maze", "Continuous Maze", "Fetch Reach", and "Ms. Pacman". For Ms. Pacman, it mentions "A variant of the Atari 2600 game Ms. Pacman...[RUMS18]". While these are well-known environments, no specific link, DOI, or explicit access information for the *datasets* used in the experiments (e.g., collected transitions) is provided. The citations are for the environments/games, not the collected data. |
| Dataset Splits | No | No specific details about train/validation/test splits (e.g., percentages, sample counts, or explicit references to standard splits) are provided in the paper's main text. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running the experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions algorithms and techniques such as "DQN", "DDPG", "softmax", and "t-SNE", but does not provide specific software library names with version numbers or programming language versions used for implementation. |
| Experiment Setup | Yes | For all environments, we run algorithms for 800 epochs, with three different random seeds. Each epoch consists of 25 cycles where we interleave between gathering some amount of transitions, to add to the replay buffer, and performing 40 steps of stochastic gradient descent on the model parameters. To collect transitions, we generate episodes using some behavior policy. For both mazes, we use a uniform policy while for Fetch Reach and Ms. Pacman, we use an "-greedy policy with respect to the current approximation F(s, a, z)>z for a sampled z. At evaluation time, "-greedy policies are also used, with a smaller ". More details are given in Appendix D. |