Many audio discovery projects are fairly straightforward. You may have a few hundred or a thousand hours of recorded calls or voicemails and you just need to load them all up and search through them interactively to find what’s relevant, responsive, privileged, or otherwise noteworthy.
But we see some audio projects that start out at a GARGANTUAN stage. Maybe it’s a Regulatory request, or an overly zealous opposing counsel that asks for every recording ever made. Whatever the reason, we sometimes see projects that start in the tens of thousands of hours, and even a few that looked to be over 100,000 hours. While it would certainly be lucrative for us to process and host all that audio, the fact is, even with mondo-discounting the price would still rise to the level that some might consider “unduly burdensome.”
Enter stage right: Data culling in audio discovery is here to save the day!
Two years ago the buzzword was Early Case Assessment. Now it’s shifted to Predictive Coding or Technology Assisted Review. Whatever you call it, the process is essentially about using a rules-based approach to screen through content and identify the files that are most likely to be on-target. And this can work as well with audio discovery as it does with our text-based relative.
In our experience, culling data for audio discovery takes two forms:
- Expression based
- Voice Activity based
Expression-based culling is much the same as what you do with textual documents. You simply identify the phrases or concepts that are likely to point to content of interest, and you run these against the full file set to identify your targets. If you’ve read my earlier posts, you know about the differences between searching text and audio, specifically as it relates to precision vs. recall. One of the big challenges with expression-based culling is to identify the optimal thresholds that will maximize the precision vs. recall trade-off. After all, with the culling process you are pulling files out of the mix that won’t be easily available for further review, so you need to be careful. Make sure you or your vendor is using statistically valid methods to test your culling criteria.
Voice activity based culling is unique to audio discovery, but can also be a big time and money saver. The need for this shows up quite often in trading floor investigations, especially when recordings are made from the open-mic or “squawk box” systems that are still in use. These systems can lead to hours and hours of silence, where the logger dutifully keeps making a recording even during off hours when no one is around. So being able to screen for presence of voice, or for a certain percentage of voice during a recording, is critical to screening these calls out and avoiding further processing and hosting charges on them.
Employing these methods, we have seen projects reduced by 50-90% in terms of total hours that ultimately go into deeper review, which saves time and money for everyone involved. Score another one for the value of technology assisted review!