Multimodal Imaginative and prescient-Language Fashions (VLMs) allow highly effective purposes from their fused understanding of pictures and language, however
many carry out poorly on UI duties because of the lack of UI coaching information. On this paper, we adapt a recipe for producing paired text-image
coaching information for VLMs to the UI area by combining present pixel-based strategies with a Massive Language Mannequin (LLM). In contrast to
prior artwork, our technique requires no human-provided annotations, and it may be utilized to any dataset of UI screenshots. We generate a
dataset of 335K conversational examples paired with UIs that cowl Q&A, UI descriptions, and planning, and use it to fine-tune a
conversational VLM for UI duties. To evaluate the efficiency of our mannequin, we benchmark it on UI aspect detection duties, consider
response high quality, and showcase its applicability to multi-step UI navigation and planning.
- ** Work carried out whereas at Apple
- † Aalto College