Skip to main content

Command Palette

Search for a command to run...

Notes on Android in the Wild: A Large-Scale Dataset for Android Device Control

This is a summary of an important research paper that provides a 22:1 time savings. It was crafted by humans working with several AI's. The goal is to save time and curate good ideas.

Published
4 min read
Notes on Android in the Wild: A Large-Scale Dataset for Android Device Control

Link to paper: https://arxiv.org/abs/2307.10088

Paper published on: 2023-07-19

Paper's authors: Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, Timothy Lillicrap

GPT3 API Cost: $0.04

GPT4 API Cost: $0.10

Total Cost To Write This: $0.14

Time Savings: 22:1

The ELI5 TLDR:

The Android in the Wild (AITW) dataset is a large collection of human demonstrations of device interactions on Android apps and websites. It includes different types of tasks and actions, with clear and descriptive labels. The dataset also includes screenshots and information about the UI elements. It is a versatile resource for training models that can understand and interact with various applications. The dataset was collected using a two-stage pipeline and is freely available for download. It can be used to develop and test device-control systems and is expected to advance research in areas like screen understanding and image captioning.

The Deeper Dive:

A Deep Exploration of the Android in the Wild (AITW) Dataset

The Android in the Wild (AITW) dataset is a significant leap in the realm of device-control research. It's a large and diverse dataset designed to aid in the development of systems that interpret human natural language instructions and perform actions directly on a device's User Interface (UI). It comprises 715k episodes spanning 30k unique prompts drawn from interactions across hundreds of Android apps and websites, four versions of Android, and eight different device types.

Understanding the Dataset

AITW is not just a dataset; it's a comprehensive collection of human demonstrations of device interactions, screens, actions, and natural language instructions. It includes both multi-step tasks that require a deep understanding of language and visual context, and single-step tasks manually annotated using a technique called hindsight relabeling.

Hindsight relabeling is a process where manual reviews of labeled trajectories are conducted to ensure clear and descriptive task descriptions. This process enhances the accuracy and quality of the dataset by providing more precise and descriptive labels for the tasks.

The Structure of AITW

AITW's actions are described by four fields: type, touch_point, lift_point, and typed_text. The 'type' field represents the kind of action performed, such as a tap, swipe, or text input. 'Touch_point' and 'lift_point' fields denote the coordinates where the action starts and ends on the screen. The 'typed_text' field is used when the action involves text input.

The dataset also includes RGB screenshots, which are post-processed to map them to detected UI elements. This process involves identifying the various UI components present in the screenshot and assigning them appropriate labels. This information can be instrumental in tasks like screen understanding and image captioning.

The Diversity of AITW

The AITW dataset is not limited to a specific type of task or application. It contains high-level tasks related to Google apps, app installation, web shopping, and general tasks. This diversity makes the dataset a rich resource for training models that can understand and interact with a wide range of applications.

Collection and Curation of AITW

The dataset was collected using a two-stage pipeline. The first stage involved raters performing tasks on Android emulators. These tasks ranged from simple actions like opening an app to complex multi-step tasks like booking a hotel. In the second stage, hindsight language relabeling was applied to the collected data to ensure the accuracy and clarity of the task descriptions.

Applications of AITW

AITW is designed to spur research in creating more powerful device automation models. It provides experimental setups for evaluation under varying conditions, including novel tasks and language, Android versions, and applications and websites. This makes AITW a versatile tool for developing and testing device-control systems.

Two models, BC and LLM, were evaluated on the AITW dataset. The BC model, which used a Transformer-based architecture and was conditioned on BERT embeddings of natural language instructions, performed better across all splits, including out-of-domain tasks. On the other hand, the LLM model had lower performance due to its element-based action space.

AITW and Future Research

The AITW dataset is expected to play a crucial role in advancing research in areas like screen understanding, screen generation, question answering, image captioning, and activity recognition. Its diverse and comprehensive nature makes it an excellent resource for training models capable of understanding and interacting with a wide range of applications.

Moreover, the AITW dataset is freely available for download on GitHub, making it accessible to researchers worldwide. This wide availability, combined with the dataset's size and diversity, is expected to catalyze significant advancements in the field of device-control research.

In conclusion, the AITW dataset is a powerful tool for developing and testing device-control systems. Its diverse and comprehensive nature, combined with its free availability, makes it an invaluable resource for researchers in this field.