DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Authors: Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, Aviral Kumar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of Digi RL using the Android-in-the-Wild (Ait W) dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement from 17.7 to 67.2% success rate over supervised fine-tuning with static human demonstration data.
Researcher Affiliation Collaboration Hao Bai1,2 Yifei Zhou1 Mert Cemri1 Jiayi Pan1 Alane Suhr1 Sergey Levine1 Aviral Kumar3,4 1UC Berkeley 2UIUC 3CMU 4Google DeepMind
Pseudocode No Figure 5: Algorithm visualization. The two value function are first trained with original distribution of collected trajectories according to Equation (4.5) and Equation (4.6), then used to filter the trajectories for training the actor. We use the MLE loss (Maximum Likelihood Estimation loss) to train the actor.
Open Source Code Yes Code available at https://github.com/Digi RL-agent/digirl.
Open Datasets Yes We demonstrate the effectiveness of Digi RL using the Android-in-the-Wild (Ait W) dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement from 17.7 to 67.2% success rate over supervised fine-tuning with static human demonstration data. These results significantly surpass not only the prior best agents, including App Agent with GPT-4V (8.3% success rate) and the 17B Cog Agent trained with Ait W data (38.5%), but also the prior best autonomous RL approach based on filtered behavior cloning (57.8%), thereby establishing a new state-of-the-art for digital agents for in-the-wild device control.
Dataset Splits No We use all 545 tasks in the training set for training and the first 96 tasks in the test set for testing due to computational and budget constraints.
Hardware Specification Yes Our main experiments are conducted on VM instances from Google Cloud Platform. Each VM instance comes with 1x Tesla T4 GPU and 16x Intel(R) Xeon(R) CPU.
Software Dependencies No The visual features output from the encoder are concatenated with instruction features derived from RoBERTa [21].
Experiment Setup Yes Hyperparameters for both Filtered BC and Digi RL are carefully tuned through binary search on the training set of General and Web Shopping subsets. The final choice of hyperparameters for both methods can be found in Table 6.