AI voice-cloning to help my sick friend speak again

Adam in the hospital, full of fighting spirit

Adam Lopez is 52 years old and lives in Toronto Canada with his wife, daughter and pet dog Ringo. Adam has been my best friend for 35 years.

In October 2021, he developed a sore throat. It persisted for a few weeks and so he decided to go to the doctors. What he thought was a fleeting case of strep turned out to be stage 4 head and throat cancer, and he was given a terminal diagnosis of less than a year.

Adam doesn’t take bad news like this on the chin, so he set out to beat the odds and defeat his affliction. Through treatments, hacks and extensive research he fought off the disease and three years later, he is still here. Sadly, the cancer has taken hold and his tumours have grown. Nerves in his face have been damaged and his throat has been badly affected, leaving him with difficulty speaking and just a whisper of his former voice.

I visited him in Toronto recently and could see the struggle he was having communicating with his family and friends. Not only had the cancer taken away his means of verbal expression, but not having his voice had also taken away part of himself.

In 2021, I worked on video material for Critical Analysis, a first-year undergraduate module at Bayes Business School. New material was made, but some legacy videos needed to be updated with new commentary to reflect the changed name of the business school, as well as introducing new lecturers. It would have been impossible for the updated material to match the acoustics of the old. With this in mind, we used a system called Descript to clone the narrator’s voice and typed in what we wanted her to say. The results were indistinguishable from the original voice and the acoustics matched exactly.

Original audio

The audio with AI changes, which was then de-noised for publication

With my experience at Bayes and with Professor Stephen Hawking’s Speak ‘n’ Spell machine in mind, I set out to help Adam and his family. After lots of research I settled on a trial of the AI voice software from ElevenLabs. This allows you to upload a voice sample and from that, a voice print is made. Using a text editor, the user types what they’d like to say, and it is spoken using the supplied voice. Several sliders allow the user to change the consistency of the dialogue, so it varies between a regular delivery or one which varies in tone and expression.

The secret to a great result is supplying a lengthy, high-quality voice recording for the software to train on. Fortunately, until recently Adam was the director of the world-famous film festival, Toronto After Dark, and he was able to supply me with lots of video clips of him talking about the festival and being interviewed. In total, I had about 1 hour and 30 mins of material. Sadly, much of it had a lot of background music and noise, and the acoustics weren’t great. Adam was also quite excitable as he enthused about his festival, which would reflect in his voice print.

An example of Adam’s real voice

I reviewed the video clips and took out the best bits, giving about 45 minutes of pure Adam. The first thing to do was to remove the background sounds and clean up the acoustics. For this, I turned to Adobe Podcast, an AI system for turning voice audio into studio quality dialogue. The results were very impressive. This was then fed into ElevenLabs, where the AI did its magic.

The results were unbelievably good. As I typed some random twaddle, and pressed the button, out popped Adam’s voice. It wasn’t just like him. It was him! I knew this was going to work.

Each conversion from text to voice uses a number of credits. The number of credits is set by the subscription you take out. I knew Adam would use some standard phrases on a regular basis and so he wouldn’t need to regularly type things like “Can you take the dog out” or “I’ll be 10 minutes”. For this, I decided to build a web app he could use on his phone, tablet or laptop with a list of common phrases. To say one, all he would need to do is tap on the phrase and the device would say it, in his voice.

Liaising with him and his family, I used ElevenLabs to create a pool of specific phrases, and I built the app over a couple of days. He now uses it to talk to his family.

The AdamTalk app

For larger, more complex conversations and material, he types straight into ElevenLabs, and the material is spoken in his voice. He recently sent me a long review of the Back to the Future Trilogy. Remember, he typed this and the software created the audio. A large proportion of the words said were never said in the original recordings; they were entirely AI generated.

Adam’s Review of the Back to the Future Trilogy

As you can imagine, having his voice and mode of expression has been a game changer.

Issues, dangers and the future

Although the system is incredible, it isn’t perfect. The system was trained on a wide range of material which varied in pacing and emotion. As a result, each time a line of text is converted, it comes out slightly differently. Dependent on the sliders, this difference is small to extreme, and sometimes, it is difficult to get the correct tone. Capitals, exclamation marks, and … can be used to change delivery and spacing which is useful. But no matter how much I tried, I could never get Adam to say “Ho Ho Ho!” convincingly like Father Christmas!

If Adam had made a professional recording before he lost his voice, in a consistent manner, then the results would probably have been better. Sadly, we didn’t have that option. The results are still 90% Adam, and the variability does make the AI voice sound more human. However, there is still an element of “uncanny valley” or “robot”. It is still remarkable considering the source material.

You may be asking, what’s to stop someone supplying someone else’s voice to the software and using the AI voice without the speaker’s permission, perhaps for illegal activity? ElevenLabs and many of the AI systems require authorisation and permission, usually by reading a piece of legal text and checking the voice matches. In the case of Adam, he couldn’t do that. After speaking to ElevenLabs, they sent out some forms for identification purposes, Adam filled them in, and the system was activated. After hearing about my friend’s situation, they generously allowed him free use, which is a wonderful gesture.

The pace of development of AI is quite scary and the results of AI images and sound are become more realistic and believable by the month. One could envision one day a system which could create a digital version of a person, with not only their voice but their entire personality and memory too. There are dangers and worries but, in this case, it has proved to be a useful tool to help my sick friend communicate with his friends and family in a natural way and keep his personality on show for all to hear.

Issues, dangers and the future

Leave a Reply Cancel reply