DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
Authors: Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, Aviral Kumar
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of Digi RL using the Android-in-the-Wild (Ait W) dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement from 17.7 to 67.2% success rate over supervised fine-tuning with static human demonstration data. |
| Researcher Affiliation | Collaboration | Hao Bai1,2 Yifei Zhou1 Mert Cemri1 Jiayi Pan1 Alane Suhr1 Sergey Levine1 Aviral Kumar3,4 1UC Berkeley 2UIUC 3CMU 4Google DeepMind |
| Pseudocode | No | Figure 5: Algorithm visualization. The two value function are first trained with original distribution of collected trajectories according to Equation (4.5) and Equation (4.6), then used to filter the trajectories for training the actor. We use the MLE loss (Maximum Likelihood Estimation loss) to train the actor. |
| Open Source Code | Yes | Code available at https://github.com/Digi RL-agent/digirl. |
| Open Datasets | Yes | We demonstrate the effectiveness of Digi RL using the Android-in-the-Wild (Ait W) dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement from 17.7 to 67.2% success rate over supervised fine-tuning with static human demonstration data. These results significantly surpass not only the prior best agents, including App Agent with GPT-4V (8.3% success rate) and the 17B Cog Agent trained with Ait W data (38.5%), but also the prior best autonomous RL approach based on filtered behavior cloning (57.8%), thereby establishing a new state-of-the-art for digital agents for in-the-wild device control. |
| Dataset Splits | No | We use all 545 tasks in the training set for training and the first 96 tasks in the test set for testing due to computational and budget constraints. |
| Hardware Specification | Yes | Our main experiments are conducted on VM instances from Google Cloud Platform. Each VM instance comes with 1x Tesla T4 GPU and 16x Intel(R) Xeon(R) CPU. |
| Software Dependencies | No | The visual features output from the encoder are concatenated with instruction features derived from RoBERTa [21]. |
| Experiment Setup | Yes | Hyperparameters for both Filtered BC and Digi RL are carefully tuned through binary search on the training set of General and Web Shopping subsets. The final choice of hyperparameters for both methods can be found in Table 6. |