reproducibilityindex.ai

Learning One Representation to Optimize All Rewards

Authors: Ahmed Touati, Yann Ollivier

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove viability of the method on several environments from mazes to pixel-based Ms Pacman and a virtual robotic arm. For single-state rewards (learning to reach arbitrary states), we provide quantitative comparisons with goal-oriented methods such as HER. ... We run our experiments on a selection of environments that are diverse in term of state space dimensionality, stochasticity and dynamics.
Researcher Affiliation	Collaboration	Ahmed Touati Mila, University of Montreal ahmed.touati@umontreal.ca Yann Ollivier Facebook Artiﬁcial Intelligence Research Paris yol@fb.com Work done during an internship at Facebook Artiﬁcial Intelligence Research Paris.
Pseudocode	Yes	At each step, a value of z is picked at random, together with a batch of transitions (s0, a0, s1) and a batch of state-actions (s0, a0) from the training set, with (s0, a0) independent from z and (s0, a0, s1). For sampling z, we use a ﬁxed distribution (rescaled Gaussians, see Appendix D). Any number of values of z may be sampled: this does not use up training samples. We use a target network with soft updates (Polyak averaging) as in DDPG. For training we also replace the greedy policy z = arg maxa F(s, a, z)>z with a regularized version z = softmax(F(s, a, z)>z/ ) with ﬁxed temperature (Appendix D). Since there is unidentiﬁability between F and B (Appendix, Remark 7), we normalize B via an auxiliary loss in Algorithm 1.
Open Source Code	Yes	Code: https://github.com/ahmed-touati/controllable_agent
Open Datasets	No	The paper uses environments like "Discrete Maze", "Continuous Maze", "Fetch Reach", and "Ms. Pacman". For Ms. Pacman, it mentions "A variant of the Atari 2600 game Ms. Pacman...[RUMS18]". While these are well-known environments, no specific link, DOI, or explicit access information for the datasets used in the experiments (e.g., collected transitions) is provided. The citations are for the environments/games, not the collected data.
Dataset Splits	No	No specific details about train/validation/test splits (e.g., percentages, sample counts, or explicit references to standard splits) are provided in the paper's main text.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running the experiments are mentioned in the paper.
Software Dependencies	No	The paper mentions algorithms and techniques such as "DQN", "DDPG", "softmax", and "t-SNE", but does not provide specific software library names with version numbers or programming language versions used for implementation.
Experiment Setup	Yes	For all environments, we run algorithms for 800 epochs, with three different random seeds. Each epoch consists of 25 cycles where we interleave between gathering some amount of transitions, to add to the replay buffer, and performing 40 steps of stochastic gradient descent on the model parameters. To collect transitions, we generate episodes using some behavior policy. For both mazes, we use a uniform policy while for Fetch Reach and Ms. Pacman, we use an "-greedy policy with respect to the current approximation F(s, a, z)>z for a sampled z. At evaluation time, "-greedy policies are also used, with a smaller ". More details are given in Appendix D.