I thought I’d get back to the accuracy question again, and go into a bit more detail on how we determine the overall accuracy of a phonetic search model based on the optimum trade-off between precision and recall. It all boils to understanding the DET chart, or Detection Error Tradeoff. Here’s what one of these looks like:
In most charts, “up and to the right” is the way you want to go. A DET is somewhat flipped from this paradigm, where “down and to the left” would represent a perfect world. But as we’ve discussed before, there’s no perfect world in search…it’s all about trade-offs. So let’s dive into the details on this chart so you can understand it better. First, what does this chart really show?
This chart shows the practical search results for five different search expressions in a typical Nexidia search. Each search expression is made up of a certain number of phonemes. The shortest expression (fewest phonemes) is shown at the top in the orange line, while the longest expression (most phonemes) is shown in pink at the bottom. The Y-axis measures the percent recall for the search, while the X-axis measures the level of precision for the search. (For a refresher on precision vs. recall, view my earlier post here.)
So what is this chart showing us? It is a dramatic and real interpretation that for any given search expression, you can maximize recall (most potential true positives) but only at the expense of precision (more false hits). The yellow line represents a search term with 8 phonemes, a typical two-to-three syllable word. Following this line all the way down to the right, you see that you can achieve almost 90 percent recall if you are willing to live with about 10 false alarms per hour of content. That’s not a bad trade-off in a compliance situation, especially when the review tool lets you quickly and easily listen to and disposition results.
As with most any type of search engine you use, the more relevant content you give it to search, the better your results. So in this case, the bottom pink line represents a search of 20 phonemes (a typical three or four word phrase) and shows that you can get over 95% recall with just one false alarm per hour, and almost 99% recall with only 10 false alarms per hour.
There are two key points that I will make again. First, because the underlying phonetic index has captured ALL the true spoken content in each recording, it offers the most accurate representation possible of what people have actually said in the file. But second, due to the many variables that make up the differences we experience in human speech (accents, background noise, etc.), reviewers can leverage this knowledge about precision vs. recall to craft a search strategy that gives them the level of search results that satisfy their goals.
I often find myself promoting the fact that Nexidia supports more than 35 languages world wide, including different “language packs” for both American and British English. People wonder why we bother with this; aren’t they essentially the same language? And since Nexidia is capturing the phonemes why can’t we just have a standard English language pack and be done with it?
Well, Yanks and Brits can certainly understand each other (for the most part) on each side of The Pond. But that’s because the human brain has an amazing ability to adapt and recognize patterns and nuances and put things into context on the fly. So when an American says “aluminum” but a Brit says “aluminium” most people realize right away they mean the same thing. But these two words do sound different, especially when you factor in the vastly different accents and dialects across the UK. So the reason we have two different language packs for essentially the same language goes back to this: we need to accurately capture the sounds made by speakers of each language, and we need to support the search and retrieval of those sounds using the common text expressions that represent those words and phrases.
Here’s a classic illustration. Let’s ponder the word “advertisement”. It’s spelled the same in both the US and the UK (and Canada…let’s not forget our northern neighbors). But it’s pronounced quite differently.
In the US, it’s ad-ver-TISE-ment.
In the UK, it’s ad-VER-tiz-ment.
So in order to provide the most accurate search possible, the Nexidia engine first captures the spoken sounds (phonemes) that are used to represent this word in a recording. Then, when the user enters the text expression to search, we convert this text back into the appropriate sounds that are representative for the accents and dialects for a particular language and find all the matches. In the North American English language pack, we know to look for ad-ver-TISE-ment, while in the UK English language pack we look for ad-VER-tiz-ment.
I haven’t even touched on the fact that we have yet another English language pack for our Aussie mates (or should I say “Ozzie mites”?). I suspect that Down Under, the word for advertisement is “Fosters,” beer being the most popular consumer product. (And yes, I know that Fosters isn’t actually popular in-country…but if I said “Four X” or “Tooheys” the rest of the world wouldn’t get my joke.)
This was obviously just one example of the literally hundreds of thousands of permutations and differences that exist even between what are essentially the same language. But it helps you better understand the work Nexidia has put in to make sure that this is all transparent to the end user. With that, I’m off to pop open a bottle of Bud, put some prawns on the barbie and settle in to watch some soccer…I mean, football!
In my last post, we looked at accuracy as a necessary trade-off between precision and recall. Then we explored some of the variables that exist in audio discovery that make it quite different from text discovery. This leads us to the next important issue in determining the level of accuracy you can achieve with audio search.
What audio search methodology are you using?
We’ll consider the two approaches that involve computer technology here, and ignore the tried and true “human listening”—which, by the way, could actually be the LEAST ACCURATE of all, but we’ll leave that topic for another day.
The technology of audio discovery is not unlike that of text search at its most basic level. Any search engine has to first create an index of the content, and then provide a means for users to search these indexes and pull back results. The true key to accuracy in audio search lies in how these indexes are created, and there are two fundamentally different approaches.
Most people are at least somewhat aware of Speech Recognition. If you’ve played around with Dragon Dictation or seen any type of automatic text generation tool, you’ve seen it. The official name for this technology is Large Vocabulary Continuous Speech Recognition, or LVSCR. Most commonly it’s known as speech-to-text.
There are some systems that will run this process against large bodies of audio content and product text indexes that can then be searched just like any other electronic documents. “That’s great!” you may say. “I can search audio just like I would anything else.”
That would be true, if the technology allowed for a perfect translation of the spoken word into textual content. Unfortunately, even with more than 50 years of eggheads (including a team of Google-ites) working on the problem, the general state of the technology is such in the best case scenario, these text documents contain 35% “word error rate,” meaning that 35% of the text is actually NOT what was being said. And that’s in high quality broadcast content with very clear speakers. When you consider the normal content found in audio discovery, with floor traders using dynamic slang amidst a cacophony of background noises, the word error rate can quickly rise to 50% or higher.
Look at the title of this blog post again: Can You Wreck A Nice Beach? Sound familiar? Say it to yourself quickly a few times, and I think you’ll get it. This is an actual translation from a speech-to-text system, and it showcases the difficulty of creating an automated translation that faithfully represents the spoken content in the recording.
The other approach uses something called “phonetic indexing and search.” To understand how this works, you need to know what a phoneme (fo’-neem) is. Phonemes are the smallest parts of speech, the individual sounds that we string together to make words, phrases and sometimes embarrassing speeches!
In a phonetic indexing system, the software analyzes the audio and, instead of laying down text, it actually creates a time-aligned, phonetic representation of the content. It is capturing all the discrete spoken sounds that are used, and here’s the key—it’s not throwing anything out! Unlike a speech-to-text system, which makes bets along the way as to what words are being spoken (and loses that bet quite often), a phonetic index has captured ALL the original content and made it available for search.
The second part of the system then provides a standard user interface with which legal reviewers can search these phonetic indexes just like they would search any other type of content. Reviewers can enter search criteria just like they’re normally spelled, and use BOOLEAN and time-based proximity searches to create structured queries and get the most relevant results. And a highly evolved phonetic searching system will even give users the ability to make their own decisions about precision vs. recall; in the legal market, this typically means favoring recall in order to find even the most challenging results.
In the short space of two blog entries, it’s impossible to cover ALL the relevant details around this topic of accuracy in audio search. For example, some might notice a bias in this entry toward the phonetic indexing approach. Guilty as charged! But that’s why we allow comments, so I welcome other people’s thoughts on this topic…post ‘em if you’ve got ‘em!
The most common question I get when introducing people to audio discovery is this: how accurate is the system? It’s an understandable question…people want a generally good sense that they can find what they’re searching for. But as with many things in life, the answer is…
And it depends on several things. How do you measure “accuracy” and what are your comparisons? What is the source of the audio and who are the speakers involved? What audio search methodology are you using, and how are you executing your search criteria? All these elements will impact the answer to “how accurate is it?” Let’s parse through them a bit and I’ll explain.
First off, what does “accuracy” really mean? Most search technologists will tell you that accuracy is a trade-off between precision and recall. A search that is 100% precise will yield only hits that are exactly what you’re looking for, aka “true positives.” A search that has 100% recall will yield every single true positive in the content that you’ve searched, but may yield a few (or billions of!) “false positives” that you also have to wade through.
In this context, a perfectly accurate search would yield 100% of all the good hits in your content without injecting any false hits along the way. That’s nirvana, utopia, heaven…call it what you will. But in the words of any self-respecting Mainer: “ya cahn’t get theah from heah!”
There are exceptions to every rule, but a 100% accurate search in any large body of content isn’t practical, so the goal is to maximize the trade-off between precision and recall, such that you are getting AS MANY AS POSSIBLE of the good hits while experiencing an ACCEPTABLE LEVEL of false positives. So with that understanding of accuracy in general, let’s address the next most important question:
What factors make audio more difficult to search than text?
Unlike text content, which tends to be more black and white (pardon the pun), audio content comes at you with many more shades of grey that must be factored into the search process. There are the common ones that people know to consider, such as accents and language differences. You know, over here we say “mustard” while across The Pond they say “Grey Pou-Pon!”
But beyond these obvious differences lie more subtle ones that can be much more insidious. Text content is not subject to extreme background noise as you might find in a typical trading floor environment. Likewise, text created on a Mac is pretty much the same as that created on a PC, whereas audio content can be created by fifty or more different types of recording devices, each with its own compression scheme and encoding characteristics that will all affect the quality of the spoken content and could throw off your search results.
In addition, there are often multiple ways to say something verbally that would have only a single common text expression. Good examples of this are numbers and acronyms. The text “225” might be spoken as “two two five,” “two twenty five” or even “two hundred twenty five.” And the acronym NCAA is commonly spoken as “N C double A” or “N C two A”.
So as you embark on an audio discovery project, you have to consider all these elements and make sure you use a methodology that will address them effectively so you can achieve the level of accuracy that you need. Which leads to the third question:
What audio search methodology are you using?
In my next post, we’ll go into more detail on both traditional and modern search techniques so you can judge for yourself what works best for your projects.