Apple researchers develop AI that can ‘see’ and understand screen context


Apple researchers have developed a new artificial intelligence system that can understand ambiguous references to on-screen entities as well as conversational and background context, enabling more natural interactions with voice assistants, according to a paper published on Friday.

The system, called ReALM (Reference Resolution As Language Modeling), leverages large language models to convert the complex task of reference resolution — including understanding references to visual elements on a screen — into a pure language modeling problem. This allows ReALM to achieve substantial performance gains compared to existing methods.

“Being able to understand context, including references, is essential for a conversational assistant,” wrote the team of Apple researchers. “Enabling the user to issue queries about what they see on their screen is a crucial step in ensuring a true hands-free experience in voice assistants.”

Enhancing conversational assistants

To tackle screen-based references, a key innovation of ReALM is reconstructing the screen using parsed on-screen entities and their locations to generate a textual representation that captures the visual layout. The researchers demonstrated that this approach, combined with fine-tuning language models specifically for reference resolution, could outperform GPT-4 on the task.

“We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references,” the researchers wrote. “Our larger models substantially outperform GPT-4.”

Practical applications and limitations

The work highlights the potential for focused language models to handle tasks like reference resolution in production systems where using massive end-to-end models is infeasible due to latency or compute constraints. By publishing the research, Apple is signaling its continuing investments in making Siri and other products more conversant and context-aware.

Still, the researchers caution that relying on automated parsing of screens has limitations. Handling more complex visual references, like distinguishing between multiple images, would likely require incorporating computer vision and multi-modal techniques.

Leave a Reply


Your email address will not be published. Required fields are marked *