Researchers use generative adversarial networks (GANs) and other machine learning techniques to manipulate audio and visual scenes that may result in deepfake videos. In principle, with sufficient training data, AI voice synthesis can generate voice skins for anybody. It’s crucial that you do not embrace the perspective that deception is the main point of voice-modeling technologies. It isn’t, and we’ve discussed this at length in our article about ethical voice cloning.
Voice cloning technology is not meant to imitate people and build fake identities, rather, its purpose is to offer you novel opportunities, like enjoying a video game that involves human-to-human audio communication while not giving up your right to remain anonymous, hearing the sweet words of a loved one who is no longer here, etc.
The top priority of voice conversion software is digital cooperation in the service of humans’ interests, without dismissing any such interests as either “irrelevant” or “more relevant than others”. In other words, AI voice synthesis must comply with ethical imperatives and be used in a socially responsible manner.
How exactly can we benefit from AI voice synthesis, and what parts of our lives can it help us improve?
1. Voice assistants
This is the generic answer to the above question. According to Ovum's Digital Assistant and Voice AI–Capable Device Forecast: 2016–21, voice assistants will outnumber the human beings that live on Earth.
The evolution of voice assistants is forecasted to take place in three stages that differ in terms of the degree of personalization. At first, in the “simple personalization” stage, digital devices such as Samsung's Bixby, Apple's Siri or Amazon's Alexa will personalize the way they serve users based on various factors of relevance, such as age, gender, language, accent, previously expressed preferences, past behaviour, etc.
Services will become more personalized as assistants start to be able to leverage information that they infer from personal data, for example, “If X is in the 18 - 35 age group, and X is located in the capital of a Scandinavian country, then X might be interested to hear about bike lanes for going from location L1 to location L2”.
The second stage is that of “advanced personalization”, when specific information will be provided based on users’ emotional state, health state, and well-being, all expressed in quantitative formats. Voice recognition will be an extremely valuable tool for identifying people’s emotional state, which can be leveraged, for example, by in car voice assistants that recommend taking a break, having a coffee, playing more uplifting music, etc.
In the “quest for meaning” stage of personalization, people’s relationship with voice assistants is expected to go beyond functionality and transaction, closer to counselling or companionship. According to Mashable, Apple’s increased hiring of people with psychology studies stands as evidence for the objective of making Siri a better therapist.
Let’s take a closer look at some specific forms that voice assistance can take.
2. Voice assistance for users with disabilities
People can lose the power of speech to different degrees, due to many disorders - from ‘mere’ stuttering or apraxia, in which syllables are scrambled; to motor neurone disease and cerebral palsy, which leave people unable to use muscle control in order to articulate; to traumatic brain injury; stroke; anatomical excisions required by the presence of malignant tumours; multiple sclerosis, or autism.
This wide range of disorders makes many in need of digital adaptive alternative communication (AAC) methods. The creation of an alternative, digital voice is underpinned by splitting two elements of the human voice that typically function as one: the source (the vocal cords, larynx and throat muscles) and the filter (the muscles - tongue, lips, pharynx, etc. - that articulate words).
Perhaps the most famous case of the use of AAC technology to compensate for loss of speech is the astrophysicist Stephen Hawking who suffered from amyotrophic lateral sclerosis. AI voice synthesis allowed him to use text-to-speech voice assistance software, capable of passing on orally what he wrote on a tablet, and thus communicate rather fluently.
Another use of AI voice synthesis in healthcare is to allow visually impaired people to read emails and text messages, to make orders online, or to manage tasks remotely: turning on the heating, drawing curtains, getting reminders for medical appointments, etc. The moral of this use case is that AI voices are getting better and better at making routine calls, hence the applicability scope of voice assistance can be expected to scale up at a fast pace.
3. Voice search
Text-based search engines return a number of solutions to the queries that you type in, and then it’s up to you to choose the links that seem to provide the information that you’re looking for. This is a ‘trial and error’ kind of process - you can keep clicking on various links until your query is satisfied.
In contrast, voice searches offer more ‘to the point’ answers, which sidestep the need to check the information that’s available ‘a click away’ for yourself. ‘A click away’ is certainly not far, but if you can do better than that, it’s worth trying, isn’t it? The affirmative answer is forecasted to generalize: by 2022, half the internet searches may take the form of spoken questions and answers.
4. Voice online shopping
AI voice synthesis offers a new approach to online shopping, where voice assistants can help you just like store employees do. That is, they can offer (more and more) personalized advice for well-informed decisions of what to buy. Ordering stuff by chatting to virtual shop assistants is the most direct route to fulfilling your material needs.
The fact that Amazon reported three times more voice shopping in 2018 than in the previous year, and estimates that the amount of shopping through voice assistants will reach $40 billion by 2022, shows that people are ready for this paradigm shift from text-to-speech in online shopping.
5. Resurrecting voices from the past
We’re referring to speech-to-speech voice cloning technology, which enables companies to clone any voice to create speech that's indistinguishable from the original speaker. Perfect for a wide range of media content creators.
Just imagine, you can now bring back the voice of a character who has sadly passed away, be that an actor, an author, or whatever other public figure. After getting proper consent, of course.
Its scope of applicability is quite wide, from film and TV, to gaming, to podcasts and audiobooks. Voice digital resurrection is also a viable alternative to bringing a high-demand actor back to the recording studio over and over again, scheduling them for voiceover or dubbing work, or having an adult actor sound just like a kid even after they’ve grown up.
AI voice synthesis paves the way towards virtual immortality. Designers can create voice AIs that emulate real people’s personalities, as they manifest themselves in spoken language.
In fact, at Respeecher we are able to substantiate the claim about voice-based virtual immortality. Using recordings of people (famous or not) who are no longer with us, and state-of-the-art speech-to-speech technology, we can give them a new voice. To put it differently, we can expand people’s presence beyond their lifetime by allowing them to say words that they never have.