Get Adobe Flash player

Audio Search Accuracy: Can You Wreck A Nice Beach?

In my last post, we looked at accuracy as a necessary trade-off between precision and recall. Then we explored some of the variables that exist in audio discovery that make it quite different from text discovery.  This leads us to the next important issue in determining the level of accuracy you can achieve with audio search.

What audio search methodology are you using?

We’ll consider the two approaches that involve computer technology here, and ignore the tried and true “human listening”—which, by the way, could actually be the LEAST ACCURATE of all, but we’ll leave that topic for another day.

The technology of audio discovery is not unlike that of text search at its most basic level. Any search engine has to first create an index of the content, and then provide a means for users to search these indexes and pull back results.  The true key to accuracy in audio search lies in how these indexes are created, and there are two fundamentally different approaches.

Most people are at least somewhat aware of Speech Recognition.  If you’ve played around with Dragon Dictation or seen any type of automatic text generation tool, you’ve seen it. The official name for this technology is Large Vocabulary Continuous Speech Recognition, or LVSCR. Most commonly it’s known as speech-to-text.

There are some systems that will run this process against large bodies of audio content and product text indexes that can then be searched just like any other electronic documents. “That’s great!” you may say. “I can search audio just like I would anything else.”

That would be true, if the technology allowed for a perfect translation of the spoken word into textual content. Unfortunately, even with more than 50 years of eggheads (including a team of Google-ites) working on the problem, the general state of the technology is such in the best case scenario, these text documents contain 35% “word error rate,” meaning that 35% of the text is actually NOT what was being said. And that’s in high quality broadcast content with very clear speakers. When you consider the normal content found in audio discovery, with floor traders using dynamic slang amidst a cacophony of background noises, the word error rate can quickly rise to 50% or higher.

Look at the title of this blog post again: Can You Wreck A Nice Beach?  Sound familiar? Say it to yourself quickly a few times, and I think you’ll get it. This is an actual translation from a speech-to-text system, and it showcases the difficulty of creating an automated translation that faithfully represents the spoken content in the recording.

The other approach uses something called “phonetic indexing and search.” To understand how this works, you need to know what a phoneme (fo’-neem) is.  Phonemes are the smallest parts of speech, the individual sounds that we string together to make words, phrases and sometimes embarrassing speeches!

In a phonetic indexing system, the software analyzes the audio and, instead of laying down text, it actually creates a time-aligned, phonetic representation of the content. It is capturing all the discrete spoken sounds that are used, and here’s the key—it’s not throwing anything out!  Unlike a speech-to-text system, which makes bets along the way as to what words are being spoken (and loses that bet quite often), a phonetic index has captured ALL the original content and made it available for search.

The second part of the system then provides a standard user interface with which legal reviewers can search these phonetic indexes just like they would search any other type of content. Reviewers can enter search criteria just like they’re normally spelled, and use BOOLEAN and time-based proximity searches to create structured queries and get the most relevant results.  And a highly evolved phonetic searching system will even give users the ability to make their own decisions about precision vs. recall; in the legal market, this typically means favoring recall in order to find even the most challenging results.

In the short space of two blog entries, it’s impossible to cover ALL the relevant details around this topic of accuracy in audio search. For example, some might notice a bias in this entry toward the phonetic indexing approach. Guilty as charged! But that’s why we allow comments, so I welcome other people’s thoughts on this topic…post ‘em if you’ve got ‘em!

Print Friendly

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

About
Can you HEAR me now?

When it comes to audio evidence, the answer is oftentimes “NO!”

And this is unfortunate, because audio evidence (or “sound recordings” as the FRCP likes to say) are becoming a critical source of discovery content in both regulatory and litigation matters. So the purpose of this blog is to help you learn what Audio Discovery is all about and how to do it in the most efficient and cost-effective ways.

As your Bloggist, I bring 20+ years of experience in audio technologies to the table, first in the old Ma Bell system and then later with companies like Cingular Wireless and now Nexidia. So I’ve witnessed first-hand many of the revolutions in digital audio that are now dramatically changing how you manage this important discovery component. In this blog, I will help you navigate these .WAVs so you can be an audio expert too. And if you didn’t get that pun, even more reason to come back often!

Jeff Schlueter
VP/GM, Legal Markets
Nexidia