Dígame, Dígame

Making language interfaces work

Published in

Scribblings on Slate

4 min readFeb 1, 2024

I think of CES as something of an Island of Misfit Toys. A lot of good intentions that often don’t quite hit the mark. But I applaud the effort. It’s really hard to build something and really easy to critique it.

Watching this year’s CES got me thinking about natural language.

Since ChatGPT last year, there’s been renewed interest in natural language as a computer interface and even associated hardware to go with it. No doubt generative AI has the ability to create new interaction models and support new devices. The ability to handle a broad range of input, to reason with that data without having to hard code every scenario, and to generate human usable output (everything from text to visual environments, again without hard coding) will open doors.

At the same time, I see a lot of carpenters out there. By that, I mean they have a hammer and everything looks like a nail. In this case, the hammer is LLMs so they are trying to address many things with the natural language nail.

To abuse Marshall Mcluhan’s ‘The Medium is the Message’, the interface medium is central to what can be expressed and experienced. Natural language can be magical, but you’ve got to solve a few key things:

Discoverability & Constraints

Language interfaces struggle to concisely give options by themselves. As a user, you just don’t know what your options are and what you can do. LLMs have improved the handling of I/O but this fundamental constraint still exists.

High cost of language input

Verbalizing cogent, concise thoughts and interacting with voice requires a lot more energy than a finger swipe. Take a very silly, simple example. Imagine you had to say ‘next’ to scroll through a feed on your phone. Would it be efficient or enjoyable? Now amplify that to more complex tasks.

Parsable

Humans have an incredible ability to visually parse their environment. Do you ever marvel at the fact you can scan a room for a familiar face or a cluttered closet for a particular item? Text and voice aren’t parsable in the same manner. One of the most annoying things about many current voice interactions are their summarization of the question or intent. It makes sense its there to ensure the context is correct, but it adds friction in a way which a visual affordance would not.

Flexibility of multimodal

You can’t watch a movie or view photos in a voice or text interface, but you can read text on a screen that can also play a movie. A device that has the ability to do text AND visuals will win out unless the case for specialization is strong.

Alignment of incentives

For many years, voice and other types of agents from Google, Amazon, and others have tried to simplify interaction with services online. But no service wants a simple, atomic, transactional interaction. As a service you want to engage, you want to upsell, you want to cross-sell.

Take a non-voice example: Google had a browser agent service that simplified buying movie tickets through sites like Fandango by automating the process for you. It was SO much better than the terrible Fandango 15 step experience. But the alignment with Fandango’s business incentives was questionable because Fandango wanted to upsell me on various crap. The same issue existed for voice assistants, and I see this issue with many of the AI agents being proposed now. You can perhaps be successful without alignment, but its going to be really hard without the market power of someone like Apple.

Social acceptance

I agree that people adapt and behaviors become normalized, but that doesn’t help much if you are trying to sell a product today. You need to create a successful arc from here to there. I love GoPro as an example of a wearable that executed well here. If you saw someone walking down the street with a camera strapped to the top of their head, you’d find that weird. But if you see someone snowboarding down a mountain with that same camera it becomes more acceptable. You can ride this curve from specialization to broad acceptance, but you have to start by solving an accute problem like GoPro did.

Spoken language can be intrusive. If you think your device is designed to be mobile, as an extreme example stop and consider how it would be used on a Japanese subway.

Interference

Lastly, spoken language can also be interfered with. If you have kids and a Google Home, Alexa, etc you may know how easy it is to have your music hijacked, or how hard it is to use a voice command above the din of kids. A car is a great place for voice, but I think to be broadly useful it has to solve for voice hijacking and (to use a generous euphemism) ‘conversational’ environments.

Where we end up — Augmentation or Specialization?

With these characteristics in mind, I think the most successful products will use language to a) augment visual experiences or b) create specialized hands free ones.

In both cases, trying to replace a mobile phone with another device is the wrong goal IMHO. Augment? Yes. Something you don’t use a phone for today? Yes.

Language interfaces will play a huge co-pilot role, drastically improving our interaction with products not as a replacement but as an enhancement. Sometimes you’ll want to ask for something, sometimes you’ll want to hit a button.

And hands-free specialization can be a huge win, but that specialization has to win out over convenience by 10x if it has to go it alone.

In all cases, language is going to play a much larger part in our technology interfaces. I can’t wait to see where it goes.