Thinking While Moving: Deep Reinforcement Learning with Concurrent Control

Authors: Ted Xiao, Eric Jang, Dmitry Kalashnikov, Sergey Levine, Julian Ibarz, Karol Hausman, Alexander Herzog

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our methods on simulated benchmark tasks and a large-scale robotic grasping task where the robot must think while moving .
Researcher Affiliation Collaboration Ted Xiao1, Eric Jang1, Dmitry Kalashnikov1, Sergey Levine1,2, Julian Ibarz1, Karol Hausman1 , Alexander Herzog3 1Google Brain, 2UC Berkeley, 3X
Pseudocode Yes Algorithm 1 shows the modified QT-Opt procedure.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the methodology described in this paper, nor does it provide a direct link to such code. It mentions using TF-Agents and QT-Opt, which are existing libraries/methods.
Open Datasets Yes We use 3D Mu Jo Co based implementations in Deep Mind Control Suite (Tassa et al., 2018) for both tasks.
Dataset Splits No The paper does not provide specific details on training, validation, and test dataset splits, such as percentages or sample counts. It mentions hyperparameter sweeps but not how the data was partitioned for them.
Hardware Specification No The paper states that “episode generation, Bellman updates and Q-fitting are distributed across many machines”, but it does not provide specific details about the hardware used (e.g., GPU models, CPU types, or memory).
Software Dependencies Yes For the baseline learning algorithm implementations, we use the TF-Agents (Guadarrama et al., 2018) implementations of a Deep Q-Network agent, which utilizes a Feed-forward Neural Network (FNN), and a Deep Q-Recurrent Neutral Network agent, which utilizes a Long Short-Term Memory (LSTM) network.
Experiment Setup Yes The number of action execution steps is selected from {0ms, 5ms, 25ms, or 50ms} once at environment initialization. t AS is selected from {0ms, 5ms, 10ms, 25ms, or 50ms} either once at environment initialization or repeatedly at every episode reset.