An Overview of Electronic Discovery Processing
By Rob Robinson
What is “Processing”?
In the realm of electronic discovery, “Processing” is any operation or set of operations which is performed upon data, whether or not by automatic means, such as collection, recording, organization, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, blocking, erasure or destruction. [i]
Why is “Processing” important?
While there are many ways to define, describe, and organize the tasks that take place in electronic discovery processing, for the purpose of this discussion we will focus on the following nine major tasks and how they interrelate to accomplish electronic discovery processing:
- Chain of Custody Security and Tracking
- Data Staging
- Data Filtering
- Deduplication
- Metadata Extraction
- Full Text Extraction
- Exception Handling
- Data Conversion
- Load File Production
Chain of Custody Security and Tracking
Defined by The Sedona Conference as “the documentation and testimony regarding the possession, movement, handling and location of evidence from the time it is obtained to the time it is presented in court; used to prove that evidence has not been altered or tampered with in any way; necessary both to assure admissibility and probative value”, Chain of Custody is the part of electronic discovery processing that ensures the evidence is authentic.
By developing, documenting, and tracking the physical media that contains electronically stored information (ESI) throughout the entire electronic discovery process can help organizations ensure their evidence is viewed as authentic. Additionally, just as physical media containing electronic documents must be treated as evidence, the same rule holds true for each individual file.
Automation of technical chain of custody activities can help in the substantiation of an exact process files go through prior to admission in a case. The benefits of automation can be even greater when the case consists of millions of files and automation ensures each file goes through the exact same process.[iii]
Data Staging
Data Staging is the process by which original ESI files are copied, isolated, and stored in a forensically sound manner for future use.
This staging typically occurs in three phases:
- Copying and storage of original ESI files on a closed and isolated network file server.
- Storage of original media and ESI files in a forensically sound manner.
- Storage of copied ESI files for use in further electronic discovery processing.
Data Filtering
Data Filtering consists of the process of identifying specific data for extraction based on specific parameters. Filtering can occur at many different levels to include:
• System File Filtering: This type of filtering is designed to exclude those files known as system files from the filtering results data set.
• Data Range Filtering: This type of filtering is designed to either include or exclude prescribed date and time ranges from the filtering results data set.
• Extension Filtering: This type of filtering is designed to either include or exclude specific files based on their extension and typically includes file type validation.
• Custodian Filtering: This type of filtering is designed to either include or exclude specific custodians from the filtering results data set.
• Key Word Filtering: This type of filtering is commonly referred to as “keyword search” and is designed to filter data by prescribed keywords and/or keyword driven concepts.
Deduplication
Deduplication is the process of identifying and segregating those files that are exact duplicates of one another. The goal is to provide a deliverable that contains one copy of each original document, while maintaining the information associated with each instance of that document within the collection.
Several ways duplicates can be identified are:
• A combination of metadata information can be compared to match files.
• An electronic fingerprint of each file can be taken and compared using a mathematical hashing algorithm such as MD5 Hash, SHA-1, or SHA-180.
• In some cases, a hashing algorithm is used in combination with metadata. [iv]
In addition to deduplication, the advent near-deduplication technologies allow for an even higher level of data deduplication as identify files that are materially similar are not bit-level duplicates. These near-deduplication technologies help identify and group/tag electronic files with “near duplicate” similarities, yet some differences in terms of content or metadata, or both. Examples include document versions, emails sent to multiple custodians, different parts of email chains, or similar proposals sent to several clients.[v]
For the complete article on iPaper, click here.
[i]
[ii] The Electronic Discovery Reference Model, March 2008
[iii]The Electronic Discovery Reference Model, March 2008
[iv] The Electronic Discovery Reference Model, March 2008
[v] The Sedona Conference Glossary, December 2007
[vi] The <MMI/> Marine Metadata Interoperability Guide
[vii] The Electronic Discovery Reference Model, March 2008
[viii]Lexbe Glossary, www.lexbe.com, March 2008
[ix]Electronic Discovery Processing: What You Need To Know To Maximize Success In Winning Cases And Cutting Costs, Metropolitan Corporation Counsel October 2007

