Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Token Bottleneck: One Token to Remember Dynamics
Authors: Taekyung Kim, Dongyoon Han, Byeongho Heo, Jeongeun Park, Sangdoo Yun
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of To Bo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. |
| Researcher Affiliation | Collaboration | 1NAVER AI Lab 2Korea University EMAIL EMAIL |
| Pseudocode | No | The paper describes the proposed method in Section 3.3 and Figure 3 with textual descriptions and diagrams, but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ naver-ai/tobo. |
| Open Datasets | Yes | We pre-train Vi T-S/16 [11] on Kinetics-400 [26] for 400 epochs for the main comparison... We validate models pre-trained by our method and other baselines in five imitation learning tasks from the Franka Kitchen benchmark [15]... We consider five manipulation tasks from RLBench [21]... We evaluate the models on four simulated environments from Cortex Bench [33]... We conduct comparative analyses for video label propagation on video object segmentation on DAVIS [38], video part segmentation on VIP [55], and pose tracking on JHMDB [23]. |
| Dataset Splits | Yes | For each task, We collect 50 demonstration episodes for training and 10 demonstration episodes for evaluation for imitation learning. |
| Hardware Specification | No | The paper mentions training models with a large batch size and refers to 'resource constrains' but does not specify the types of GPUs, CPUs, or other hardware used for training or experimentation. |
| Software Dependencies | No | The paper mentions algorithmic components like AdamW optimizer, MLP, batch normalization, and behavior cloning loss, but does not provide specific software library names with version numbers (e.g., PyTorch, TensorFlow, Python version, CUDA). |
| Experiment Setup | Yes | We pre-train Vi T-S/16 [11] on Kinetics-400 [26] for 400 epochs for the main comparison... We use Adam W optimizer [32] with a batch size of 1536, comprising dynamic scenes with a resolution of 224 x 224. These scenes are randomly sampled from videos at a rate of 30 FPS, with a temporal index gap ranging from 4 to 96. We simply apply random resized crop and horizontal flip to the scenes... To drive the learning mechanism of our proposed method, we randomly mask the target scenes with an extremely high masking ratio of 0.9. Our decoder is composed of eight vision transformer blocks... Training for each demonstration task progresses for 20,000 steps... |