Humanoid Locomotion as Next Token Prediction

Authors: Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train our model on a dataset of sequences from a prior neural network policy, a model-based controller, motion capture, and You Tube videos of humans. We show that our model enables a real humanoid robot to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor sequences.
Researcher Affiliation Academia Ilija Radosavovic UC Berkeley Bike Zhang UC Berkeley Baifeng Shi UC Berkeley Jathushan Rajasegaran UC Berkeley Sarthak Kamat UC Berkeley Trevor Darrell UC Berkeley Koushil Sreenath UC Berkeley Jitendra Malik UC Berkeley
Pseudocode No The paper describes the model architecture and mathematical formulations (Equations 1-10) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The Neur IPS Paper Checklist explicitly states 'No' to the question: 'Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?'
Open Datasets Yes We construct a dataset of trajectories from four different sources: (i) a neural network policy, (ii) a model-based controller, (iii) human motion capture, and (iv) human videos from You Tube. ... human motion capture (Mo Cap) recordings of humans from the KIT dataset [24] distributed via the AMASS repository [21].
Dataset Splits No The paper mentions evaluating prediction error on 'a separate set of validation data that is held out from training data' in Section 5.3, but does not provide specific details on the size, percentage, or methodology of this split.
Hardware Specification Yes We train on 4 NVIDIA A100s.
Software Dependencies No The paper mentions using specific software like 'Isaac Gym [22]' and 'MuJoCo simulator [36]' but does not provide specific version numbers for these or other software components.
Experiment Setup Yes Our model has a hidden size of 192 dimensions, with 4 layers of self-attention layers and MLP layers. Each self-attention has 4 heads. We use Layer Norm before each attention layer and Re LU activation after the MLP layer. We use a Batch Norm layer to process the input before the transformer model. When predicting a token at time k, to keep the context length at a reasonable size, we only keep the past 16 steps in input. ... We compute x (t), the ideal robot base position trajectory that fully satisfies the velocity command v (t) at all time steps. To measure the accuracy of command tracking, we define the position tracking error as 1 T PT t=0 x(t) x (t) . Each trajectory lasts for a duration of 10 seconds.