Home Blog News About Us Contact Us

Real-Time Predictive Coding

Clustify provides a full range of technology-assisted review (TAR) capabilities, including real-time predictive coding, conceptual clustering, near-duplicate detection, and content-based email threading, to make document review faster, more consistent, and less expensive while finding important documents earlier and completing the review faster. It is a powerful tool that is also easy to use.

Predictive coding software learns to tag or categorize documents by analyzing training documents categorized by a human reviewer. Once testing establishes that the algorithm is doing a good job of separating relevant documents from those that aren't relevant, human review can be focused on just the documents that are predicted to be relevant, drastically reducing the time and expense of human review. A more aggressive approach is to produce documents that the software predicts to be relevant without a human reviewing them unless the document is suspected to be privileged. Less aggressive uses include prioritizing documents even when all of them will be reviewed by humans, or testing for inconsistencies in the human review. Here are some things you should know about Clustify's predictive coding capability:

  • Real-Time - Clustify is the first tool with real-time predictive coding, meaning that it updates its predictions for the entire document population each time you review a training document, and shows the impact instantly.* Training the system is like having continuous one-on-one interactions between a student and a teacher, rather than lecturing a student for a semester and finding out whether he/she learned anything via the final exam.
  • Documents sorted by relevance score with mis-tagged document warning icon shown. Consistency Checking - If Clustify thinks you've mis-tagged a document, it displays a warning icon adjacent to it. This is done in real time, so you are alerted while the document is still fresh in your mind, not later when review of a full "batch" of documents is completed.
  • Relevance Score - Clustify produces a relevance score for each document, not just a yes/no relevance prediction. Documents with a higher score have a higher probability of being relevant according to the algorithm. As you work your way down the list of documents sorted by relevance score, each additional relevant document you find will require wading through more and more non-relevant documents. You can choose your stopping point based on human review cost and proportionality.
  • Progress Pie Progress Pie - Clustify shows the confidence of its predictions in the progress pie.
  • Control Sets - You have the option to create a control set so you can see how predictions compare to human review results while the system is being trained. The precision-recall curve for the control set shows evolution of the quality of the predictions in real time.
  • Random vs. Judgmental - When relevant documents are rare, it can be hard to find enough relevant training documents with random sampling, so preferentially sampling documents that are more likely to be relevant, commonly known as judgmental sampling, may be necessary. Clustify allows you to use any combination of random and judgmental sampling.
  • Active Learning - Clustify offers several different algorithms for selecting documents to use for training that are most likely to improve its understanding of the document population.
  • Smart Sampling - This is a special active learning algorithm that improves on the continuous active learning (CAL) approach. CAL has been shown to be highly efficient when prevalence is low and the classification problem is difficult. Like CAL, smart sampling selects documents that are highly likely to be relevant, but it also aims for diversity and a balanced training set that results in good predictions even when the learning process is terminated, which allows greater workflow flexibility.
  • Flexible Workflow - Clustify does not force a particular workflow on the user. You can review whatever documents you want and see results immediately without having to wait for batches to be completed. The workflow can be as simple as reviewing documents until the progress pie is substantially filled in, iteratively reviewing the documents that have the highest relevance scores while also using them as training documents (CAL), and proceeding to testing when truly relevant documents become sparse among the remaining documents with highest relevance scores.
  • Keyword Labels - Each document is labeled with a few important keywords to aid navigation and assessment of the documents.
  • Shadow Tags - If you have a set of documents that have already been tagged, Clustify makes it easy to perform experiments to see how quickly the algorithm learns and how different training strategies perform. Simply load the tags as "shadow tags" and they will be applied to the documents automatically whenever a new training or control set is generated, so you can simulate an entire document review involving many iterations of training and testing in a matter of minutes and see how the results evolve.
  • Email Headers and Footers - You can tell Clustify to identify and ignore email headers and footers for cleaner, more accurate result. You dont' want your predictive coding software categorizing documents based on bloated confidentiality footers that dwarf the body text.

Click here for a video showing a few of Clustify's predictive coding capabilities.

To learn more about how Clustify can improve the e-discovery process, fill out the form below, or contact us.


*In a test on 1.3 million documents containing 3.3 gigabytes of text on a modest desktop computer, it took Clustify an average of a tenth of a second to update the relevance scores for the population each time a training document was tagged. Time required depends on the data set.