Adelphoi Music

Our view

Jamie explores the shift away from typing to speaking

Siri, Alexa, Cortana, Google Assistant... don't they all sound a bit alike? Even when you take all the variants into account – Siri, for instance, can speak English in Irish, British, U.S., Canadian, South African, Indian, Australian, Singaporean, and New Zealand accents, either male or female, but it still sounds to me as if the designers' first priority was come up with the blandest, most 'transparent', least characterful voice possible. And I don't hear anything different in the other 'intelligent personal assistant' software offered by the other big names. Put the voices side by side, and the biggest difference I hear is just what kind of speaker they're coming out of.

There are plenty of reasons for this, I suppose. The technology, for one. The text-to-speech (TTS) systems developed for the voices of these virtual assistants is generated by splicing together tiny slices taken from recordings of real people, and it's a complicated enough job dealing with the natural ups and downs, pauses, and speeding-ups and slowing-downs we use unconsciously even when we're being deadpan, without having to add in all the other stuff that shows character or emotion. And then again, for us poor humans it's still a novel experience to be able to talk to our devices and have them talk back to us. I don't know if we're actually ready for a computer that talks like a real person.

Both these factors are going to change. The technology will get better at humanising synthetic speech, it will be much easier and cheaper to do at all, and we will get used to hearing it. In fact, we're going to be deluged with it, as the keyboard and keypad give way to voice-activated services all over the internet. All those interactive voice response (IVR) systems that you currently dial into on your phone – buying, booking, banking – will be reincarnated as speaking voices on websites. I can't imagine that the net will ever lose its visual dimension, but the shift away from typing to speaking will put a huge new emphasis on audio: music and voice, but above all voice.

When that happens, the voices will have to distinguish themselves. Brands will have individual voices, and the voice will be part of the brand. When people interact with the digital world, they may have to rely purely on audio cues just to navigate. The voices will be the guides, and it's from them that people will get their first clues about what the brand stands for.

That's the point at which vocal character will become important. I wonder how that will all work out. I suppose a time must come when we stop worrying about whether we're talking to a real person or a computer, so we won't need speech synthesis to do what it does now, which is to sound just human enough without actually fooling anyone. Will computer voices eventually become more individual, more characterful, more gnarly, more emotional? That's as hard to predict as the weather. But technology will push us that way, as well as our own inner need to try new things.

But characterful voices bring new factors into the equation. See, one useful thing about the bland Siri-voice is that it gives just enough positive cues (friendly, attentive, native speaker, male/female, adult...), to invite users to construct a personality in their own heads, connecting the dots and filling in the blanks in whatever way works for them (and there's plenty of evidence that users do precisely that). It's tremendously flexible: the Siri-voice becomes whatever it needs to be for all the people who use it. But as voices become more individual, there's more for the user to respond to: things that irritate as well as charm.

Which might just bring us back to where we are with 'real' voices. In advertising, actors, celebrities and voiceover artists tend to be used on a campaign-by-campaign basis, and it's rare to find anything more consistent or long-lasting. Garrison Keillor for Honda (in the UK), perhaps? Generally, associations between brands and voices have been temporary and precarious, because fashions change, things get stale, celebrity status goes up and down, meanings shift. Is it possible that that's the way we'll go: computer-voices auditioned and weighed up with as much attention as voiceover actors are on TV commercials? Or will businesses want to 'build' the perfect voice for their brand and stick with it like a logo? Somewhere in between, perhaps.