AI is finally unlocking a new interface paradigm

We’ve been using the graphical interface for decades, while voice assistants have struggled. Could AI be the key that unlocks a new form of computing?

For the past 40 years, we’ve known exactly how to make computers work: you have some way of pointing at something, and making something happen. Be it a mouse, trackpad, or just your finger, that’s the way we’ve controlled our tech for four decades. The touch interface was merely an evolution of the windows, icons, menus, pointer interface that came before it.

But could that be about to change?

People have been playing with voice interfaces for years. Many of us have smart speakers in our homes, but their uses have been more limited than many people expected. The need to know specific incantations to make the magic tube do something you wanted means that, for more people, it’s come down to weather, timers, and music.

Interface backsliding

And, in some ways, we’re actually going backwards. Today’s hot new thing, generative AI, has spawned the skill of “prompt crafting” — writing requests for the AI to generate the thing you want. A quick web search turns up dozens of guides and courses for this newly essential skill. Yet, for those of us old enough, and with functioning memories, this seems strangely familiar. We remember booting a computer and staring at a blank screen and a text prompt. Ah, the days of DOS when you had to know the incantations to get the computer to do anything.

This, obviously, won’t last. Adobe’s Firefly AI has a visual interface for some of its elements, and we should expect others to follow in its direction. Having to remember to type “—ar 16:9” at the end of a prompt to get a landscape image in Midjourney, for example, is more like something from the 70s than the 2020s.

The restricted use of smart speakers and the limitations of prompt-crafting: these are basically the same problem. They’re about the discoverability of features of the technology and the device’s ability to understand us. They’re about context. And perhaps the way they’ll eventually evolve into the next true interface is via bringing them together.

Apple’s new AI-infused interface

If the rumours are to be believed, Apple is working hard to bring AI features to its devices in the next operating system cycle, likely to be announced at their developer conference in June. And a recent research paper suggests that it outperforms GPT4 in situations where context is needed:

ReALM takes into account both what’s on your screen and what tasks are active.

The paper refers to this as “reference resolution” — the ability of the LLM to understand references to things that aren’t already part of the conversation the user is having with it. It edges AI towards a more human-like understanding of sentences based on other contextual clues.

If this works, there’s the potential for voice assistants, like Apple’s deservedly maligned Siri, to actually evolve towards more conversational interfaces. Right now they rely on you stumbling over or learning the exact incantation.

Voice redux

Somewhere, in the intersection of voice recognition and modern large language models, we have the potential to build a new model of interface, one that’s rooted in the way we communicate a lot of the time: our voices. The fact that smart speakers remain popular despite their limitations strongly suggests that the demand is there for this kind of interface. So, while current AIs might evolve towards a more graphical interface, they might also evolve towards a more natural language one. Being able to say “take image three and make it landscape” would take away the necessity of both remembering and typing “—ar 16:9”.

And the big AI companies certainly need to have a look at how their models operate, and improve it. Because the days of just throwing more and more training data at them and allowing that to improve them may well be coming to an end:

Some executives and researchers say the industry’s need for high-quality text data could outstrip supply within two years, potentially slowing AI’s development.
AI companies are hunting for untapped information sources, and rethinking how they train these systems. OpenAI, the maker of ChatGPT, has discussed training its next model, GPT-5, on transcriptions of public YouTube videos, people familiar with the matter said.

Better interfaces that allow people to do more with what these models offer already might well be a more profitable course in the short term than just trying to find more data to train on. And smart voice interfaces would help transform some of the devices we already use. Smart speakers are a decade old this year — truly bringing AI to them could well be transformative, as big a jump as the move from dumb phones to smartphones was 17 years ago.

The next interface: spatial awareness

But just as your Nokia flip phone and your Amazon Echo were early glimpses of something that will — eventually — transform into something more compelling, we can see the very first signs of what could be next right now. Well, if you have a spare $3,500 lying around, that is. The Apple Vision Pro is a first stab at making a computer — rather than a games device — that is aware of the space around you. The Apple Car project might well be dead now, but some of the technology lives on in the myriad sensors in the headset. Indeed, Apple has been slowly working on contextually aware interfaces for the last few years.

You can experience this for significantly less money than the Vision Pro via the somewhat more affordable AirPods Pro. Their Conversation Awareness mode, which knocks back the noise-cancelling when they detect you’re speaking with someone, is an example. This echoes the Vision Pro’s ability to spot when someone is approaching you, and bring them into your digital field of vision.

This is a whole other set of context. It’s a vital one if we’re ever to truly see self-driving cars — but that still feels like it’s as far off as it was when we last did sessions on it when NEXT was in Berlin. Carrying around these sensor-filled devices that can contextually understand what is around us physically, and change our interactions with them based on that is the moment that computing truly escapes the screen, and becomes ambient.

The next interface might not even feel like an interface at all.

Picture by AdobeStock.