PST File (Outlook) Data Extraction Plan

PST Data Extraction Plan Explorer

I. Executive Summary

This section provides a high-level overview of the plan to extract, normalize, and structure data from Microsoft Outlook PST files. Explore the details below to understand the objectives, methodology, and expected outcomes for migrating historical communications into a centralized, searchable database for compliance, eDiscovery, or archival needs. The focus is on data integrity using specialized tools.

II. Project Scope & Objectives

This section defines the boundaries and goals of the PST extraction project. Understand what data and processes are included (In-Scope) and excluded (Out-of-Scope), and review the specific, measurable targets for completeness, integrity, and efficiency.

A. Scope (In-Scope and Out-of-Scope)

Component Status Details
PST Files IN All identified PST files residing on the specified network path/local drives.
Extraction Targets IN Email messages (including headers, body, attachments), calendar entries, and contact data.
Output Format IN Export to a normalized, database-ready format (e.g., JSONL or SQL inserts).
Data Integrity IN Verification of hash values (MD5/SHA1) before and after extraction.
System Decommission OUT Deletion or secure destruction of original source PST files post-extraction.
Native Application Access OUT Providing users with direct access to extracted data via an Outlook client.

B. Objectives

> 99.5%
Completeness Target

Extract nearly all data objects (emails, contacts, calendar items).

100%
Integrity Goal

Maintain original metadata (timestamps, senders, attachments).

≤ 30
Efficiency Target (Days)

Complete extraction, processing, and indexing pipeline rapidly.

III. Data Extraction Methodology

This section details the step-by-step approach for extracting data, focusing on preparation, tool selection, and the core eDiscovery standard process. Click on each phase of the extraction process below to reveal more details.

A. Preparation and Sourcing

  • **Inventory:** Create a definitive list of all source PST files (size, date modified).
  • **Hashing (Source):** Generate MD5/SHA-256 hashes pre-extraction for Chain of Custody.
  • **Environment:** Use a locked-down forensic or virtual environment.

B. Tool Selection

A licensed, non-consumer grade application (e.g., forensic eDiscovery software, `pst-extractor`) will be used, capable of handling corruption, bulk processing, and detailed logging.

C. Extraction Process (E-Discovery Standard)

1
Deep Parsing
→
2
Metadata Export
→
3
Body & Attachments
↓
↓
4
Normalization
Phase 1: Deep Parsing
The tool scans the PST structure, identifying all nested objects (messages, contacts, folders).
Output: Raw Item Count Log
Phase 2: Metadata Export
Extract all metadata fields (e.g., Subject, Date, From, To, CC, BCC).
Output: CSV/JSON Metadata Files
Phase 3: Body & Attachment Export
Extract the plain text and HTML body of each item, along with any attachments (placed in a structured folder hierarchy).
Output: Structured Folders with Attachments
Phase 4: Normalization
Convert output into a uniform data structure (JSONL), mapping Outlook-specific fields to standard database fields (e.g., mapping PR_MESSAGE_DELIVERY_TIME to timestamp).
Output: Normalized JSONL File

IV. Data Structure and Normalization

This section outlines the target format for the extracted data. Review the standard fields, their source mappings within the PST file, data types, and important notes like timestamp formatting and hashing for traceability.

Output Field Source Mapping Data Type Notes
messageID PST Item ID / EntryID String Unique identifier for archival.
senderEmail PR_SENDER_EMAIL_ADDRESS String Normalized to lowercase.
timestamp PR_MESSAGE_DELIVERY_TIME Datetime UTC format MANDATORY (YYYY-MM-DD HH:MM:SS Z).
subject PR_SUBJECT String
bodyText Extracted plain text body Text Used for primary searching.
attachmentCount Count of attachments found Integer
attachmentPaths Array of paths to extracted attachment files. Array of Strings Relative paths only.
originalHash SHA-256 hash of the original PST file String Used for linking back to the source archive.

V. Timeline and Resource Allocation

This section visualizes the estimated project timeline, assuming 2TB of data across 100 PST files. The chart shows the duration for each key activity and the responsible team.

VI. Risks and Mitigation

This section identifies potential challenges during the extraction process and outlines the strategies to address them. Review the identified risks, their potential impact, and the planned mitigation steps.

Risk Impact Mitigation Strategy
File Corruption Loss of entire PST file contents. Use forensic tool with checkpointing and sector-level recovery capabilities.
Timezone Ambiguity Incorrect sorting/archiving of messages. Enforce strict conversion of all timestamps to **UTC** (Z) during the Normalization phase.
Large Attachments Extraction process stalls or fails due to memory limitations. Implement bulk extraction process with dedicated spool directories and incremental attachment export.
Scroll to Top