Audio Search Accuracy: Can You Wreck A Nice Beach?
In my last post, we looked at accuracy as a necessary trade-off between precision and recall. Then we explored some of the variables that exist in audio discovery that make it quite different from text discovery. This leads us to the next important issue in determining the level of accuracy you can achieve with audio search.
What audio search methodology are you using?
We’ll consider the two approaches that involve computer technology here, and ignore the tried and true “human listening”—which, by the way, could actually be the LEAST ACCURATE of all, but we’ll leave that topic for another day.
The technology of audio discovery is not unlike that of text search at its most basic level. Any search engine has to first create an index of the content, and then provide a means for users to search these indexes and pull back results. The true key to accuracy in audio search lies in how these indexes are created, and there are two fundamentally different approaches.
Most people are at least somewhat aware of Speech Recognition. If you’ve played around with Dragon Dictation or seen any type of automatic text generation tool, you’ve seen it. The official name for this technology is Large Vocabulary Continuous Speech Recognition, or LVSCR. Most commonly it’s known as speech-to-text.
There are some systems that will run this process against large bodies of audio content and product text indexes that can then be searched just like any other electronic documents. “That’s great!” you may say. “I can search audio just like I would anything else.”
That would be true, if the technology allowed for a perfect translation of the spoken word into textual content. Unfortunately, even with more than 50 years of eggheads (including a team of Google-ites) working on the problem, the general state of the technology is such in the best case scenario, these text documents contain 35% “word error rate,” meaning that 35% of the text is actually NOT what was being said. And that’s in high quality broadcast content with very clear speakers. When you consider the normal content found in audio discovery, with floor traders using dynamic slang amidst a cacophony of background noises, the word error rate can quickly rise to 50% or higher.
Look at the title of this blog post again: Can You Wreck A Nice Beach? Sound familiar? Say it to yourself quickly a few times, and I think you’ll get it. This is an actual translation from a speech-to-text system, and it showcases the difficulty of creating an automated translation that faithfully represents the spoken content in the recording.
The other approach uses something called “phonetic indexing and search.” To understand how this works, you need to know what a phoneme (fo’-neem) is. Phonemes are the smallest parts of speech, the individual sounds that we string together to make words, phrases and sometimes embarrassing speeches!
In a phonetic indexing system, the software analyzes the audio and, instead of laying down text, it actually creates a time-aligned, phonetic representation of the content. It is capturing all the discrete spoken sounds that are used, and here’s the key—it’s not throwing anything out! Unlike a speech-to-text system, which makes bets along the way as to what words are being spoken (and loses that bet quite often), a phonetic index has captured ALL the original content and made it available for search.
The second part of the system then provides a standard user interface with which legal reviewers can search these phonetic indexes just like they would search any other type of content. Reviewers can enter search criteria just like they’re normally spelled, and use BOOLEAN and time-based proximity searches to create structured queries and get the most relevant results. And a highly evolved phonetic searching system will even give users the ability to make their own decisions about precision vs. recall; in the legal market, this typically means favoring recall in order to find even the most challenging results.
In the short space of two blog entries, it’s impossible to cover ALL the relevant details around this topic of accuracy in audio search. For example, some might notice a bias in this entry toward the phonetic indexing approach. Guilty as charged! But that’s why we allow comments, so I welcome other people’s thoughts on this topic…post ‘em if you’ve got ‘em!


