In any random collection of electronic and scanned documents there exists a significant amount of duplicate information. Often multiple copies of the same document are attached to a variety of email. Alternatively both an electronic document and its paper image could be contained in the same collection.
While exact duplicate electronic documents are easily matched using MD5 hash codes, similar-documents(or near-duplicates) are not absolutely identical and are consequently much more difficult to identify. Using technology to group similar-documents and code them in the same manner, rather than reviewing and coding them individually, can save significant document review time and ultimately reduce review costs.
There are three basic types of near-duplicate files:
First are matching documents which contain the same content but are stored in different formats, such as a Microsoft Word document and an Adobe PDF created from that document.
A second type includes multiple documents created from the same template; these documents are identical except for the varied information entered in the template fields, such as names and dates in the case of contract templates.
The third type of near-duplicate documents consists of those which contain overlapping information. Email collections containing messages that have been commented on and forwarded fall into this category, since the forwarded message will contain the original email plus additional information.
The use of eDiscovery technology to identify and group these similar documents can have a dramatic impact on the time and effort required to review a document set. By way of example, we at H&AeDiscovery recently assisted in a matter involving a single plaintiff and multiple defendants from the same organization. Each defendant was represented by different counsel, each of whom undertook to independently compile their affidavit of documents. The documents consisted of both Electronically Stored Information (ESI) and paper. All of the documents were provided to us as TIFF images and load files of varying formats. Given that all the documents were from individuals within the same organization, it was anticipated that many documents would be duplicated between production sets. We were retained to consolidate and de-duplicate the productions into a single production to be imported into CT Summation.
We used the similar document identification utility of eExamine, powered by Equivio, to identify near-duplicates within the entire population of documents, using the OCR text from each document image. The threshold for the definition of a similar document was set such that the identified documents would, for all intents and purposes, be considered identical.
The results of this process were:
| Total documents: |
38,673 |
|
| Unique documents without any “Near-Duplicates”: |
23,812 |
|
| Documents with one or more “Near-Duplicates”: |
14,861 |
|
| Number of “Near-Duplicate” groups: |
3,588 |
|
| Number of documents potentially removed from the review process: |
11,273 |
|
| Number of documents in the smallest “Near-Duplicate” group: |
2 |
|
| Number of documents in the largest “Near-Duplicate” group: |
129 |
|
The resulting population of 27,400 unique documents consisted of:
• 23,812 unique documents without any near duplicates
• 3,588 unique documents with one or more near duplicates
eExamine's similar document utility enabled us to identify duplicates both within and between the production sets and resulted in a reduction of the document population to be reviewed of approximately 29%.
In this matter, almost one-third of the documents remaining after the initial production sets had been de-duplicated (exact duplicates removed) fell into the category of similar documents or near-duplicates. Grouping them together within eExamine allowed these document groups to be bulk-coded, resulting in a tremendous savings in review time.
Back to Publications