We’ve been involved in a couple of projects recently where there is confusion about file formats and compatibility with Windows Media Player (WMP). Specifically, one of the most common types of digital audio file extensions is the “.wav” extension, which is GENERALLY playable in all the standard media players like WMP, iTunes, RealPlayer, etc. The .wav extension has become more or less a “standard” if such a thing exists in digital audio.
But the truth of the matter is quite a bit more complex, and can make your projects more challenging if you are not aware of it. In reality, the “.wav” extension merely defines a “wrapper” for an audio file, and tells whatever media player you are trying to use certain information about that file. This information can vary considerably, but the most important thing it tells is what codec was used to create the actual digital audio track.
What’s a codec, you ask? Well, that is an industry abbreviation for the “coder-decoder” that was used to capture the analog audio and convert it to a digital signal. Every digital sound file has gone through a transformation, from the original analog sound waves to a series of bits and bytes that are generated to capture that sound wave and then recreate it through the computer. The original capture and digitization of that sound is the coding part, and the subsequent processing of that code to play it back out of a computer (or CD player, DVD player, etc.) is the decoding part. And there are many different codecs that can be used to accomplish this task.
One of the oldest and most common codecs is PCM (pulse code modulation), which is the standard for capturing and playing back audio from computers, CDs, etc. But while PCM can produce a nicely accurate rendition of the original sound, it tends to be very “uncompressed” which means the digital files it creates are quite large. In order to solve for the problem of file size, and still retain fidelity in different regions of sound, different codecs have sprung up to handle different types of audio. For example, MP3 was developed specifically to retain a higher fidelity of sound for music, but at the sacrifice of keeping fidelity in the sound ranges represented by normal speech. Similarly, the telephony industry has developed certain codecs to maintain adequate representation of normal spoken content but to reduce the fidelity of other parts of the sound range to save file space. (The more educated reader will recognize that I am glossing over the issue of sampling and its effect on file size. For those that are interested, go to this this Wikipedia page.)
But here is the crux of the matter, and the trouble you may run into with an audio discovery project. The “.wav” envelope can contain audio files in any number of different codecs, and these codecs are not all playable in WMP…or any other standard player. The reason for this is that not all codecs are freely licensed. Some codec developers have put restrictions on their creations, requiring that anyone wanting to play files in this format have a license to do so. And since Microsoft (and Apple, and Real) don’t feel like paying millions of dollars to these developers for each copy of WMP (iTunes, RealPlayer, etc.) they distribute, then these codecs are not recognized by those media players. And thus, you have a .wav file that you may not be able to play.
Since audio discovery is most commonly about reviewing telephone calls, you are likely to run into two of the codecs that have become popular in telephony, due to their file compression and voice fidelity characteristics: TrueSpeech and G.729. And while these files will commonly come with a .wav extension, you will get a nice little error message from WMP when you try to play them. Two things to keep in mind when you are dealing with files of this nature:
- You must make sure your audio discovery vendor has the ability to deal with them natively so that you can accomplish your review; and,
- You must also make sure that your vendor can convert any of these files that you need to produce to regulators or opposing counsel. Otherwise, you are likely to get a pretty nasty response (or worse!) back from them if you’ve delivered files they can’t play.
If you would like to know more about file formats, or learn about Nexidia’s ability to handle these (and just about any other) formats then please visit our website or reach out to me directly. We can make sure you can surf the .wavs without wiping out.
I thought I’d get back to the accuracy question again, and go into a bit more detail on how we determine the overall accuracy of a phonetic search model based on the optimum trade-off between precision and recall. It all boils to understanding the DET chart, or Detection Error Tradeoff. Here’s what one of these looks like:
In most charts, “up and to the right” is the way you want to go. A DET is somewhat flipped from this paradigm, where “down and to the left” would represent a perfect world. But as we’ve discussed before, there’s no perfect world in search…it’s all about trade-offs. So let’s dive into the details on this chart so you can understand it better. First, what does this chart really show?
This chart shows the practical search results for five different search expressions in a typical Nexidia search. Each search expression is made up of a certain number of phonemes. The shortest expression (fewest phonemes) is shown at the top in the orange line, while the longest expression (most phonemes) is shown in pink at the bottom. The Y-axis measures the percent recall for the search, while the X-axis measures the level of precision for the search. (For a refresher on precision vs. recall, view my earlier post here.)
So what is this chart showing us? It is a dramatic and real interpretation that for any given search expression, you can maximize recall (most potential true positives) but only at the expense of precision (more false hits). The yellow line represents a search term with 8 phonemes, a typical two-to-three syllable word. Following this line all the way down to the right, you see that you can achieve almost 90 percent recall if you are willing to live with about 10 false alarms per hour of content. That’s not a bad trade-off in a compliance situation, especially when the review tool lets you quickly and easily listen to and disposition results.
As with most any type of search engine you use, the more relevant content you give it to search, the better your results. So in this case, the bottom pink line represents a search of 20 phonemes (a typical three or four word phrase) and shows that you can get over 95% recall with just one false alarm per hour, and almost 99% recall with only 10 false alarms per hour.
There are two key points that I will make again. First, because the underlying phonetic index has captured ALL the true spoken content in each recording, it offers the most accurate representation possible of what people have actually said in the file. But second, due to the many variables that make up the differences we experience in human speech (accents, background noise, etc.), reviewers can leverage this knowledge about precision vs. recall to craft a search strategy that gives them the level of search results that satisfy their goals.
I often find myself promoting the fact that Nexidia supports more than 35 languages world wide, including different “language packs” for both American and British English. People wonder why we bother with this; aren’t they essentially the same language? And since Nexidia is capturing the phonemes why can’t we just have a standard English language pack and be done with it?
Well, Yanks and Brits can certainly understand each other (for the most part) on each side of The Pond. But that’s because the human brain has an amazing ability to adapt and recognize patterns and nuances and put things into context on the fly. So when an American says “aluminum” but a Brit says “aluminium” most people realize right away they mean the same thing. But these two words do sound different, especially when you factor in the vastly different accents and dialects across the UK. So the reason we have two different language packs for essentially the same language goes back to this: we need to accurately capture the sounds made by speakers of each language, and we need to support the search and retrieval of those sounds using the common text expressions that represent those words and phrases.
Here’s a classic illustration. Let’s ponder the word “advertisement”. It’s spelled the same in both the US and the UK (and Canada…let’s not forget our northern neighbors). But it’s pronounced quite differently.
In the US, it’s ad-ver-TISE-ment.
In the UK, it’s ad-VER-tiz-ment.
So in order to provide the most accurate search possible, the Nexidia engine first captures the spoken sounds (phonemes) that are used to represent this word in a recording. Then, when the user enters the text expression to search, we convert this text back into the appropriate sounds that are representative for the accents and dialects for a particular language and find all the matches. In the North American English language pack, we know to look for ad-ver-TISE-ment, while in the UK English language pack we look for ad-VER-tiz-ment.
I haven’t even touched on the fact that we have yet another English language pack for our Aussie mates (or should I say “Ozzie mites”?). I suspect that Down Under, the word for advertisement is “Fosters,” beer being the most popular consumer product. (And yes, I know that Fosters isn’t actually popular in-country…but if I said “Four X” or “Tooheys” the rest of the world wouldn’t get my joke.)
This was obviously just one example of the literally hundreds of thousands of permutations and differences that exist even between what are essentially the same language. But it helps you better understand the work Nexidia has put in to make sure that this is all transparent to the end user. With that, I’m off to pop open a bottle of Bud, put some prawns on the barbie and settle in to watch some soccer…I mean, football!
We’re involved in several very high profile matters at the moment, each with thousands of hours of audio, some of it in multiple languages. And I sat through a project team meeting today where we were discussing the set of search terms that one of the law firms had developed to start running against these audio files.
What transpired during this meeting is so common I thought I would pass it along. You see, quite understandably, the law firm that developed the search terms took them directly from the same set of terms that had been developed for the email search. But the reality is that people tend to speak very differently than they write, so I spent a good thirty minutes going over the search terms and providing suggestions to shrink the list and make it more realistic.
Confidentiality prevents me from using any of the real terms from this case, but here are some illustrative examples:
- People don’t talk like they “text”. You’ll never (well, seldom) hear someone actually say “LOL” or “TTFN”. (Although I have been known to say “WTF” from time to time!) Granted, these aren’t likely to be meaningful search terms themselves, but other such contractions that may be used in emails between traders will have another spoken equivalent.
- Proper names, especially people’s names, tend to morph quite a bit in spoken form from what you may see in email. Around the office people call me “Der Schlueter” with a really bad German accent. But I can’t remember the last time anybody used either Jeff or Schlueter in an email. Names are often omitted because the recipients are assumed based on the addresses used.
- Certain types of information have only one form in which they would typically appear in text, but could be spoken in many different ways. Numerical data is like this. Somebody may purchase 1,900 shares of a security, but the trader might say “one thousand nine hundred” or “nineteen hundred” which in an audio search are two totally different constructs.
During the course of the aforementioned meeting, one of the review team leaders finally came up with the suggestion that I had been hinting at all along. Which is that, instead of spending a lot of time THINKING about what the search terms should be, the better approach is to simply start searching with a few of the most realistic and highly probable terms that will bring up the responsive files. Then start listening to these files, and getting a better understanding of the language used and which terms will be the most relevant for searching.
You don’t have to listen to hundreds of hours to do this. In my experience, listening to just one hour of different calls for each major custodian will give you a great idea of the best terms to use. Develop the term list from there, do some searching, and listen to some more. You may come up with another set of terms that you can then add to your search criteria and iterate through again. This iterative process is what will help you round out your search term list and be confident in the results.
Many audio discovery projects are fairly straightforward. You may have a few hundred or a thousand hours of recorded calls or voicemails and you just need to load them all up and search through them interactively to find what’s relevant, responsive, privileged, or otherwise noteworthy.
But we see some audio projects that start out at a GARGANTUAN stage. Maybe it’s a Regulatory request, or an overly zealous opposing counsel that asks for every recording ever made. Whatever the reason, we sometimes see projects that start in the tens of thousands of hours, and even a few that looked to be over 100,000 hours. While it would certainly be lucrative for us to process and host all that audio, the fact is, even with mondo-discounting the price would still rise to the level that some might consider “unduly burdensome.”
Enter stage right: Data culling in audio discovery is here to save the day!
Two years ago the buzzword was Early Case Assessment. Now it’s shifted to Predictive Coding or Technology Assisted Review. Whatever you call it, the process is essentially about using a rules-based approach to screen through content and identify the files that are most likely to be on-target. And this can work as well with audio discovery as it does with our text-based relative.
In our experience, culling data for audio discovery takes two forms:
- Expression based
- Voice Activity based
Expression-based culling is much the same as what you do with textual documents. You simply identify the phrases or concepts that are likely to point to content of interest, and you run these against the full file set to identify your targets. If you’ve read my earlier posts, you know about the differences between searching text and audio, specifically as it relates to precision vs. recall. One of the big challenges with expression-based culling is to identify the optimal thresholds that will maximize the precision vs. recall trade-off. After all, with the culling process you are pulling files out of the mix that won’t be easily available for further review, so you need to be careful. Make sure you or your vendor is using statistically valid methods to test your culling criteria.
Voice activity based culling is unique to audio discovery, but can also be a big time and money saver. The need for this shows up quite often in trading floor investigations, especially when recordings are made from the open-mic or “squawk box” systems that are still in use. These systems can lead to hours and hours of silence, where the logger dutifully keeps making a recording even during off hours when no one is around. So being able to screen for presence of voice, or for a certain percentage of voice during a recording, is critical to screening these calls out and avoiding further processing and hosting charges on them.
Employing these methods, we have seen projects reduced by 50-90% in terms of total hours that ultimately go into deeper review, which saves time and money for everyone involved. Score another one for the value of technology assisted review!
Two excellent reports have come out in the last year or so that address a pair of related issues: the growing costs of e-discovery, and the use of technology assisted review to help curtail those costs. While neither one addresses audio discovery specifically, the general thesis still applies; technology really can help you do things better, and cheaper. Who doesn’t like better and cheaper?
Well, there is actually an answer to that question which I’ll get back to in a minute. But first, a bit more detail on the two reports I mentioned.
The first is an article from the Richmond Journal of Law and Technology by Maura Grossman and Gordon Cormack. The link will download the entire article for you so I’ll spare you the legal citations, and in a short blog entry I have nowhere near the time to cover all the points. But I quote the first two sentences in the Conclusion of the report:
Overall, the myth that exhaustive manual review is the most effective – and therefore, the most defensible – approach to document review is strongly refuted. Technology-assisted review can (and does) yield more accurate results than exhaustive manual review, with much lower effort.
Why does manual review fare so poorly in this competition? Lots of reasons, but a big piece of it is reviewer fatigue, and also that reviewers make mistakes and often don’t agree on the significance of what they’ve read. Shocking! Not everyone thinks alike. Go figure.
The second report from the Rand Institute for Civil Justice, titled “Where the Money Goes”, looks at the cost elements involved in discovery. Again, the link is there for you to download a summary or the whole report, so I want to key on just one element. When looking at the costs of producing electronic documents, their finding was that 73% of the cost came during the Review component of the EDRM. What does that really mean?
It means that no matter how much people gripe about charges from the e-discovery vendors, it’s still all those in-house and outside attorneys, paralegals and other folks who are eyeballing the documents that drive the total cost in the process. And as with the Grossman article, the Rand report provides evidence that technology can help make the whole process better, and cheaper.
How does this apply to audio discovery? For years, if any party presented or requested large bodies of audio evidence for discovery, the expected process for managing this discovery was human review. And it generally takes about 4 hours of human time to review each 1 hour of audio. So if even a bargain-basement contract attorney makes $75/hour, that’s $300/hour of audio in review costs. Even a fairly small 1000 hour project would create a $300,000 cost, and most of the time the parties would just cry “unduly burdensome” and whisk it under a rug.
Fast forward to today, and effective audio discovery technology exists that has been proven effective in federal regulatory investigations, criminal cases and other litigation matters. It can lower costs by as much as 80%, in much the same way that technology assisted review lowers other e-discovery costs. And yet, we see an interesting phenomenon, which is that many law firms still espouse the use of manual review to run these audio projects. Who wouldn’t want something better and cheaper?
People often ask me who my competition is in the audio discovery arena. And while there are a few other technology providers in this space, my answer to this question is actually different. My biggest competition is…wait for it…the billable hour. Law firms make profit on billable hours. They don’t make profit on e-discovery costs (generally speaking).
I realize this is a bold and harsh statement, and I wouldn’t make it so blatantly except 1) I have heard from actual law firms who confirmed it for me, and 2) I’m not sure how many people are reading this blog yet, so I could use some publicity!
But seriously, if you have an opinion on this, weigh in here. Comments are welcome!
In my last post, we looked at accuracy as a necessary trade-off between precision and recall. Then we explored some of the variables that exist in audio discovery that make it quite different from text discovery. This leads us to the next important issue in determining the level of accuracy you can achieve with audio search.
What audio search methodology are you using?
We’ll consider the two approaches that involve computer technology here, and ignore the tried and true “human listening”—which, by the way, could actually be the LEAST ACCURATE of all, but we’ll leave that topic for another day.
The technology of audio discovery is not unlike that of text search at its most basic level. Any search engine has to first create an index of the content, and then provide a means for users to search these indexes and pull back results. The true key to accuracy in audio search lies in how these indexes are created, and there are two fundamentally different approaches.
Most people are at least somewhat aware of Speech Recognition. If you’ve played around with Dragon Dictation or seen any type of automatic text generation tool, you’ve seen it. The official name for this technology is Large Vocabulary Continuous Speech Recognition, or LVSCR. Most commonly it’s known as speech-to-text.
There are some systems that will run this process against large bodies of audio content and product text indexes that can then be searched just like any other electronic documents. “That’s great!” you may say. “I can search audio just like I would anything else.”
That would be true, if the technology allowed for a perfect translation of the spoken word into textual content. Unfortunately, even with more than 50 years of eggheads (including a team of Google-ites) working on the problem, the general state of the technology is such in the best case scenario, these text documents contain 35% “word error rate,” meaning that 35% of the text is actually NOT what was being said. And that’s in high quality broadcast content with very clear speakers. When you consider the normal content found in audio discovery, with floor traders using dynamic slang amidst a cacophony of background noises, the word error rate can quickly rise to 50% or higher.
Look at the title of this blog post again: Can You Wreck A Nice Beach? Sound familiar? Say it to yourself quickly a few times, and I think you’ll get it. This is an actual translation from a speech-to-text system, and it showcases the difficulty of creating an automated translation that faithfully represents the spoken content in the recording.
The other approach uses something called “phonetic indexing and search.” To understand how this works, you need to know what a phoneme (fo’-neem) is. Phonemes are the smallest parts of speech, the individual sounds that we string together to make words, phrases and sometimes embarrassing speeches!
In a phonetic indexing system, the software analyzes the audio and, instead of laying down text, it actually creates a time-aligned, phonetic representation of the content. It is capturing all the discrete spoken sounds that are used, and here’s the key—it’s not throwing anything out! Unlike a speech-to-text system, which makes bets along the way as to what words are being spoken (and loses that bet quite often), a phonetic index has captured ALL the original content and made it available for search.
The second part of the system then provides a standard user interface with which legal reviewers can search these phonetic indexes just like they would search any other type of content. Reviewers can enter search criteria just like they’re normally spelled, and use BOOLEAN and time-based proximity searches to create structured queries and get the most relevant results. And a highly evolved phonetic searching system will even give users the ability to make their own decisions about precision vs. recall; in the legal market, this typically means favoring recall in order to find even the most challenging results.
In the short space of two blog entries, it’s impossible to cover ALL the relevant details around this topic of accuracy in audio search. For example, some might notice a bias in this entry toward the phonetic indexing approach. Guilty as charged! But that’s why we allow comments, so I welcome other people’s thoughts on this topic…post ‘em if you’ve got ‘em!