Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Thinking vs. Doing: Improving Agent Reasoning by Scaling Test-Time Interaction

Authors: Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Peter Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet S Talwalkar, Aviral Kumar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first show that even prompting-based interaction scaling without any training can improve task success on web benchmarks non-trivially. Building on this, we introduce TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data web agents on Web Voyager and Web Arena benchmarks.
Researcher Affiliation	Collaboration	Junhong Shen1,2 Hao Bai3 Lunjun Zhang4 Yifei Zhou5 Amrith Setlur1 Shengbang Tong7 Diego Caples6 Nan Jiang3 Tong Zhang3 Ameet Talwalkar1 Aviral Kumar1 1CMU 2Scribe 3UIUC 4U of Tronoto 5UC Berkeley 6The AGI Company 7NYU
Pseudocode	Yes	We provide the pseudocode in Algorithm 1. For the replay buffer, to encourage the agent to learn from more recent examples, we assign weights based on recency when sampling rollouts to update the agent: for the k-th trajectory added to the buffer, its weight is k \|D\|. Algorithm 1 TTI: Filtered Behavior Cloning with Interaction Scheduling
Open Source Code	Yes	Project page: https://test-time-interaction.github.io Code: https://github.com/test-time-interaction/TTI
Open Datasets	Yes	Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data web agents on Web Voyager and Web Arena benchmarks. We further show that TTI enables agents to balance exploration and exploitation adaptively.
Dataset Splits	Yes	We randomly sample 62 tasks for testing and reserve the remaining 750 for online training (see Section 5.1).
Hardware Specification	Yes	We use v LLM [88] to sample rollouts and use Deep Speed Zero 3 [89] with NVIDIA H100 GPUs for training.
Software Dependencies	No	We use v LLM [88] to sample rollouts and use Deep Speed Zero 3 [89] with NVIDIA H100 GPUs for training. The evaluator is a Gemma 3 27B model, prompted to detect successful trajectories, which can then be used for the online filtered BC procedure. Other hyperparameters such as the number of iterations, learning rate, and the exact schedule of TTI can be found in Appendix F.3.
Experiment Setup	Yes	We use the following hyperparameters to obtain the training curves in Table 4. During training, the vision_tower of Gemma 3 is kept frozen because it is frozen during pretraining. Other hyperparameters can be found in our code in the supplementary material. num_iteration: 10 actor_epochs: 1 # number of epochs to update the actor rollout_size: 512 num_update_sample_per_iteration: 512 optimizer: Adam W scheduler: Warmup Cosine LR batch_size: 4 grad_accum_steps: 2 eval_horizon: 30