#ATTDevSummit: Random House playing with AT&T Speech API to eliminate human Audiobook Voice Over recording

Disclaimer: My primary job when I’m not tech blogging is commercial, animation, and video game voice over casting in Hollywood.

One interesting side discussion during this year’s Developer Summit was the utilization of new speech API’s from AT&T. It’s based on the Watson engine, which you heard as the HAL-like voice which spanked a couple of flesh bags on Jeopardy back in 2011. It’s an alpha API, but is already light years ahead of the basic Text to Speech engines in use for audiobooks.

As a voice over director and producer, I completely understand some of the challenges in recording with people. It’s an endurance match recording a 30 second TV spot, let alone a whole book. I’ve always been shocked by people who listen to the books recorded by artificial voices, devoid of any performance, and with tragically awful emphasis. Those older speech engines will say all the words in the right order, but they can’t tell you a story.

This new speech engine is working to change that. It’s still wholly artificial, but it can now represent several different characters instead of just one voice type. Plus it can be programmed to follow punctuation and energy levels for urgency and emphasis. It makes the act of listening to an audiobook a lot easier when there’s some sense of through line or narrative.

It’s obviously years away from replacing the terrific actors whole labor over these kinds of projects, but computer voices are improving rapidly. We all make jokes about Siri, but Google and Apple have delivered mobile data assistants with voices we would’ve thought impossible at the consumer level even just five years ago. And then there’s Watson, which can even learn the nuance of language well enough to pick up on swearing and other colorful metaphors.

Seth Stell from Smashing Ideas was on hand to demo some of the work they’re doing with Random House to replace humans. I shot video of the demonstration, but unfortunately the Galaxy S4 Zoom I used ate the audio, which is like the most important part of demoing a speech engine. Thankfully there’s a Livestream of the event embedded below.
Skip to 22 minutes to begin the piece on Speech hosted by Random House.