I. Executive Summary
This section provides a high-level overview of the plan to extract, normalize, and structure data from Microsoft Outlook PST files. Explore the details below to understand the objectives, methodology, and expected outcomes for migrating historical communications into a centralized, searchable database for compliance, eDiscovery, or archival needs. The focus is on data integrity using specialized tools.
II. Project Scope & Objectives
This section defines the boundaries and goals of the PST extraction project. Understand what data and processes are included (In-Scope) and excluded (Out-of-Scope), and review the specific, measurable targets for completeness, integrity, and efficiency.
A. Scope (In-Scope and Out-of-Scope)
| Component | Status | Details |
|---|---|---|
| PST Files | IN | All identified PST files residing on the specified network path/local drives. |
| Extraction Targets | IN | Email messages (including headers, body, attachments), calendar entries, and contact data. |
| Output Format | IN | Export to a normalized, database-ready format (e.g., JSONL or SQL inserts). |
| Data Integrity | IN | Verification of hash values (MD5/SHA1) before and after extraction. |
| System Decommission | OUT | Deletion or secure destruction of original source PST files post-extraction. |
| Native Application Access | OUT | Providing users with direct access to extracted data via an Outlook client. |
B. Objectives
Extract nearly all data objects (emails, contacts, calendar items).
Maintain original metadata (timestamps, senders, attachments).
Complete extraction, processing, and indexing pipeline rapidly.
III. Data Extraction Methodology
This section details the step-by-step approach for extracting data, focusing on preparation, tool selection, and the core eDiscovery standard process. Click on each phase of the extraction process below to reveal more details.
A. Preparation and Sourcing
- **Inventory:** Create a definitive list of all source PST files (size, date modified).
- **Hashing (Source):** Generate MD5/SHA-256 hashes pre-extraction for Chain of Custody.
- **Environment:** Use a locked-down forensic or virtual environment.
B. Tool Selection
A licensed, non-consumer grade application (e.g., forensic eDiscovery software, `pst-extractor`) will be used, capable of handling corruption, bulk processing, and detailed logging.
C. Extraction Process (E-Discovery Standard)
The tool scans the PST structure, identifying all nested objects (messages, contacts, folders).
Output: Raw Item Count Log
Extract all metadata fields (e.g., Subject, Date, From, To, CC, BCC).
Output: CSV/JSON Metadata Files
Extract the plain text and HTML body of each item, along with any attachments (placed in a structured folder hierarchy).
Output: Structured Folders with Attachments
Convert output into a uniform data structure (JSONL), mapping Outlook-specific fields to standard database fields (e.g., mapping PR_MESSAGE_DELIVERY_TIME to timestamp).
Output: Normalized JSONL File
IV. Data Structure and Normalization
This section outlines the target format for the extracted data. Review the standard fields, their source mappings within the PST file, data types, and important notes like timestamp formatting and hashing for traceability.
| Output Field | Source Mapping | Data Type | Notes |
|---|---|---|---|
| messageID | PST Item ID / EntryID | String | Unique identifier for archival. |
| senderEmail | PR_SENDER_EMAIL_ADDRESS | String | Normalized to lowercase. |
| timestamp | PR_MESSAGE_DELIVERY_TIME | Datetime | UTC format MANDATORY (YYYY-MM-DD HH:MM:SS Z). |
| subject | PR_SUBJECT | String | |
| bodyText | Extracted plain text body | Text | Used for primary searching. |
| attachmentCount | Count of attachments found | Integer | |
| attachmentPaths | Array of paths to extracted attachment files. | Array of Strings | Relative paths only. |
| originalHash | SHA-256 hash of the original PST file | String | Used for linking back to the source archive. |
V. Timeline and Resource Allocation
This section visualizes the estimated project timeline, assuming 2TB of data across 100 PST files. The chart shows the duration for each key activity and the responsible team.
VI. Risks and Mitigation
This section identifies potential challenges during the extraction process and outlines the strategies to address them. Review the identified risks, their potential impact, and the planned mitigation steps.
| Risk | Impact | Mitigation Strategy |
|---|---|---|
| File Corruption | Loss of entire PST file contents. | Use forensic tool with checkpointing and sector-level recovery capabilities. |
| Timezone Ambiguity | Incorrect sorting/archiving of messages. | Enforce strict conversion of all timestamps to **UTC** (Z) during the Normalization phase. |
| Large Attachments | Extraction process stalls or fails due to memory limitations. | Implement bulk extraction process with dedicated spool directories and incremental attachment export. |
