SureDox PII Determination Framework
SureDox — PII Determination Framework
Purpose
This document provides a repeatable, legally grounded framework for determining what constitutes Personal Information (PI) on each document type SureDox processes. It serves as the authoritative reference when building ground truth files for regression testing.
Authority hierarchy:
- POPIA Section 1 (definitions) and Section 26 (special PI) — primary
- HIPAA §164.514(b) Safe Harbor — consulted for health data where POPIA is silent
- UK ICO Anonymisation Guidance (March 2025) — consulted for the "motivated intruder" / "reasonably foreseeable" test
- GDPR Article 4 — consulted for genetic data definition where POPIA is silent
Why foreign guidance is appropriate: POPIA contains no specific de-identification provisions beyond the definition in Section 1 and the exclusion in Section 6. South African legal scholars explicitly recommend consulting HIPAA §164.514(b) and the UK ICO code of practice where POPIA is silent (Swales, 2021, South African Journal of Science). The Information Regulator has not published de-identification guidance as of February 2026.
Part 1: Universal Rules (All Document Types)
These rules apply regardless of document category.
1.1 The POPIA Test
For any data item on any document, ask:
Is this information "relating to an identifiable, living, natural person" (or, where applicable, an identifiable, existing juristic person)? — POPIA Section 1
If YES → it is PI. The next question is: PI of whom?
1.2 Data Subject Identification
Every detection must identify the data subject — the person whose PI it is:
| Relevance | Meaning | Default action |
|---|---|---|
primary | PI of the document's subject (patient, signatory, account holder, employee) | Redact by default |
secondary | PI of another natural person appearing on the document (physician, witness, auditor) | Flag for review |
institutional | PI of a juristic person (company, lab, bank) | Context-dependent |
1.3 The De-identification Purpose Test
SureDox redacts documents for de-identification under POPIA Section 14(4). The purpose is to remove information that identifies the data subject. Therefore:
If an item is PI of someone OTHER than the data subject, it only needs redaction if it could be used as a re-identification pathway back to the data subject via a "reasonably foreseeable method" (POPIA Section 1 definition of "de-identify").
1.4 Universal PI Categories (Section 1)
These are always PI when they relate to an identifiable person:
| POPIA Reference | Category | Examples |
|---|---|---|
| Section 1(h) | Name | Full name, initials, maiden name, alias |
| Section 1(c) | Identifying number | ID number, passport number, account number, MRN |
| Section 1(a) | Demographic | Race, gender, sex, age, DOB, nationality, language |
| Section 1(a) | Physical/mental health | Medical conditions, diagnoses, treatment history |
| Section 1(b) | Employment | Job title, employer, salary, employment history |
| Section 1(c) | Contact information | Address, phone, email (when linked to a person) |
| Section 1(d) | Biometric | Fingerprints, handwritten signatures, voiceprints |
| Section 1(e) | Financial | Income, assets, debts, credit history |
| Section 26 | Special PI | Race, ethnic origin, trade union membership, political opinions, health, sex life, criminal behaviour, religious/philosophical beliefs, genetic data |
1.5 Universal Non-PI Categories
These are NEVER PI regardless of document type:
| Category | Examples | Rationale |
|---|---|---|
| Legislative references | "Companies Act 71 of 2008", "POPIA Section 26" | Public law, not information relating to any person |
| Structural elements | Section headings, page numbers, clause numbers | Document formatting |
| Generic role labels | "the Client", "the Employee", "the Patient" | Template references without identifying information |
| Publicly available regulatory data | CIPC registration (searchable on BizPortal), CLIA numbers | Per SALRC guidance, public registry data |
| Geographic references without identifier function | Arbitration venues, jurisdiction clauses | Location serving a procedural purpose, not identifying any person |
1.6 The POPIA Basis Requirement
Every item in a ground truth expected_detections array must cite a specific POPIA section. If no section can be cited, the item does not belong in detections.
Part 2: Document Type Definitions
Document types SureDox supports:
| Code | Document Type | Data Subject | Primary Use Case |
|---|---|---|---|
legal_agreement | Contracts, NDAs, service agreements, settlement agreements | Signatories (natural and juristic persons) | Share agreements without exposing client/party identities |
medical_health | Lab reports, clinical notes, discharge summaries, prescriptions, pathology reports | Patient | De-identify for research, training, secondary use |
financial | Bank statements, invoices, tax certificates, payslips | Account holder / taxpayer | Share financial records without exposing identity |
hr_employment | CVs, employment contracts, disciplinary records, performance reviews | Employee / candidate | Share HR documents without exposing identity |
identity_document | ID books, passports, driver's licences, birth/death certificates | Document holder | Verify then redact for storage/sharing |
general | Letters, emails, mixed correspondence | Varies | Catch-all for documents that don't fit above categories |
Part 3: Per-Domain PII Guidelines
3.1 Legal Agreements (legal_agreement)
Data subjects: The contracting parties (natural persons as signatories, juristic persons as entities).
What IS PI (detect):
| Item | POPIA Basis | Relevance | Notes |
|---|---|---|---|
| Natural person names (signatories) | Section 1(h) | primary | Names of individuals who signed |
| Juristic person names (parties) | Section 1 "where applicable" | primary | Company names as contracting parties — POPIA unlike GDPR protects juristic persons |
| Signing dates | Section 1(c) | primary | Identifiers linked to identifiable persons in context of the signing event |
| Job titles of signatories | Section 1(b) | primary | Employment information of the signatory |
| Handwritten signatures | Section 1(d) | primary | Biometric information |
| Witness names | Section 1(h) | secondary | PI of the witness, not the parties |
What is NOT PI (do not detect):
| Item | Rationale |
|---|---|
| "the Client", "the Customer", "the Employer" | Role labels — template references without identifying content |
| Section headings ("Preamble", "Definitions") | Structural elements |
| Legislative references ("Companies Act 71 of 2008") | Public law |
| AFSA, CCMA, other statutory bodies | Public institutions, not parties to the agreement |
| Arbitration/jurisdiction venue locations | Geographic references serving procedural purpose, not identifying any person |
| Template text describing qualifications/credentials generically | Template reference, not person-specific |
| Third-party software names (e.g. "Sage", "SAP") | Public record / product names |
Borderline cases:
| Item | Decision | Rationale |
|---|---|---|
| Company registration numbers | Detect (primary) | Section 1(c) identifier of the juristic person party |
| VAT numbers of parties | Detect (primary) | Section 1(c) identifier |
| Physical addresses of parties | Detect (primary) | Section 1(c) contact information of parties |
| Professional body memberships mentioned | Do not detect | Template reference unless naming a specific person |
3.2 Medical / Health Documents (medical_health)
Data subject: The patient.
Special considerations: POPIA Section 26 classifies health data and genetic data as special personal information requiring heightened protection. HIPAA §164.514(b) Safe Harbor provides the most detailed guidance on which items must be removed for health data de-identification. South African scholars recommend consulting it where POPIA is silent.
What IS PI (detect):
| Item | POPIA Basis | Relevance | Notes |
|---|---|---|---|
| Patient name | Section 1(h) | primary | |
| Date of birth | Section 1(a) "birth" | primary | |
| Medical record number (MRN) | Section 1(c) | primary | Database key — immediate re-identification |
| Accession/specimen/family IDs | Section 1(c) | primary | Lab system identifiers traceable to patient |
| Sex/gender | Section 1(a) "sex" | primary | |
| Race/ethnicity | Section 26 special PI | primary | Heightened protection required |
| Clinical indication/diagnosis specific to patient | Section 1(a) + Section 26 | primary | "Clinical diagnosis of DCM with arrhythmia" — patient-specific medical history |
| Patient-specific genetic variants | Section 26 genetic data | primary | c.244G>A, p.Glu82Lys — patient's actual mutations |
| Genomic coordinates of patient variants | Section 26 genetic data | primary | Chr11:g.47364249G>A — molecular "GPS" to patient's mutation |
| Report/service dates tied to patient | Section 1(c) | primary | Specimen received, report issued — narrows re-identification window |
| Referring physician name | Section 1(h) | secondary | PI of the physician. Low re-identification risk for patient unless highly specialised. Confidence 0.50 |
| Report author/signatory names | Section 1(h) | secondary | PI of lab staff. Very low re-identification risk — they sign hundreds of reports. Confidence 0.50 |
What is NOT PI (do not detect):
| Item | Rationale | Supporting authority |
|---|---|---|
| Lab/hospital name on letterhead | Institutional identity, not patient PI. Same letterhead appears on every report. | HIPAA: no requirement to remove covered entity name |
| Lab address, phone, fax, URL | Publicly available institutional contact details | HIPAA §164.514(b): 18 identifiers scoped to "the individual", not the institution |
| Referring hospital/institution name | Institutional reference, not patient PI | HIPAA: referring facility not one of 18 identifiers |
| Gene names in coverage/panel tables | Methodology reference — the lab's test menu, not patient results | Scientific standards, not "information relating to" any person |
| Disease descriptions in general information sections | General medical knowledge, not patient-specific | "Dilated Cardiomyopathy is characterised by..." describes the disease, not the patient |
| Methodology terms (GRCh37, "Whole genome sequencing", GATK) | Scientific standards and lab procedures | |
| Specimen types ("Blood, Peripheral") | Methodology reference | |
| Transcript IDs (NM_001122335) | Public scientific database references | |
| OMIM numbers, database references | Public scientific references | |
| CLIA/CAP certification numbers | Publicly available regulatory IDs | SALRC: public registry data |
| Literature citations and author names in reference lists | Published academic work | |
| Illumina, other vendor/platform names | Product names |
Decision tree for health documents:
Is this about the PATIENT specifically?
├── YES: Is it a name, ID, date, demographic, diagnosis, or genetic finding?
│ ├── YES → DETECT as primary PI (cite POPIA section)
│ └── NO → Evaluate: does it narrow re-identification via "reasonably foreseeable method"?
│ ├── YES → DETECT
│ └── NO → DO NOT DETECT
└── NO: Is it about the INSTITUTION (lab, hospital)?
├── YES → DO NOT DETECT (institutional letterhead, not patient PI)
└── NO: Is it about a PROFESSIONAL (doctor, lab scientist)?
├── YES → PI of that person under Section 1(h)
│ └── Could it create a re-identification pathway to the patient?
│ ├── HIGH RISK (specialist with few patients) → DETECT at reduced confidence
│ └── LOW RISK (signs hundreds of reports) → DETECT at 0.50, secondary
└── NO: Is it METHODOLOGY or GENERAL KNOWLEDGE?
└── YES → DO NOT DETECT
3.3 Financial Documents (financial)
Data subject: The account holder / taxpayer / payee.
What IS PI (detect):
| Item | POPIA Basis | Relevance | Notes |
|---|---|---|---|
| Account holder name | Section 1(h) | primary | |
| ID number | Section 1(c) | primary | |
| Account number | Section 1(c) | primary | |
| Card number | Section 1(c) | primary | Also a security risk |
| Physical address | Section 1(c) | primary | |
| Phone number, email | Section 1(c) | primary | |
| Date of birth (if shown) | Section 1(a) | primary | |
| Salary/income amounts (on payslips) | Section 1(e) | primary | Financial information |
| Tax number | Section 1(c) | primary | |
| Transaction amounts and dates | Section 1(e) | primary | Financial information tied to the account holder |
| Balance information | Section 1(e) | primary |
What is NOT PI (do not detect):
| Item | Rationale |
|---|---|
| Bank/institution name and branding | Institutional identity — publicly known. Same header appears on millions of statements |
| Branch code | Semi-public — identifies the branch, not the person |
| Generic banking terms ("Monthly service fee", "Interest earned") | Product/service descriptions |
| Exchange rates, interest rates | Public financial data |
| Regulatory references (FICA, NCA) | Public law |
| Bank contact details (call centre number, website) | Public institutional information |
Borderline cases:
| Item | Decision | Rationale |
|---|---|---|
| Merchant names in transactions | Do not detect | PI of the merchant (juristic person), not the account holder. However, a pattern of merchant visits (e.g., pharmacy, specialist clinic) could reveal health information. Flag for user discretion in specific contexts. |
| Transaction descriptions with location info | Do not detect by default | "WOOLWORTHS SANDTON" — merchant + location, not account holder PI. But patterns could narrow location. |
| Beneficiary names in transfers | Detect (secondary) | PI of another natural person — Section 1(h). Linked to account holder's financial activity. |
| Employer name on payslip | Detect (primary) | Section 1(b) employment information of the data subject |
3.4 HR / Employment Documents (hr_employment)
Data subject: The employee or candidate.
What IS PI (detect):
| Item | POPIA Basis | Relevance | Notes |
|---|---|---|---|
| Employee/candidate name | Section 1(h) | primary | |
| ID number / passport number | Section 1(c) | primary | |
| Date of birth, age | Section 1(a) | primary | |
| Physical address | Section 1(c) | primary | |
| Phone, email | Section 1(c) | primary | |
| Tax number, UIF number | Section 1(c) | primary | |
| Salary, benefits, bonus amounts | Section 1(e) | primary | |
| Bank account details (for salary payments) | Section 1(c) + (e) | primary | |
| Employment dates | Section 1(b) | primary | |
| Job title, department | Section 1(b) | primary | |
| Performance ratings/scores | Section 1(b) | primary | |
| Disciplinary findings | Section 1(b) | primary | Could also be Section 26 (criminal behaviour) |
| Qualifications, schools, universities | Section 1(b) + (a) "education" | primary | |
| Referee names and contact details | Section 1(h) + (c) | secondary | PI of the referee |
| Medical fitness certificates | Section 26 health | primary | Special PI |
| Race, gender, disability status | Section 26 + Section 1(a) | primary | Special PI — often collected for employment equity |
| Next of kin / emergency contact details | Section 1(h) + (c) | secondary | PI of another person |
What is NOT PI (do not detect):
| Item | Rationale |
|---|---|
| Company name (employer) on letterhead | Institutional identity — public. Same letterhead on every employee's documents. Similar to lab letterhead rationale. |
| Generic job descriptions/duties | Template text describing the role, not the person |
| Company policies referenced | Institutional documents |
| CCMA case references (case number only, no names) | Procedural identifiers |
| Industry/sector classifications | General information |
| Statutory references (BCEA, LRA, EEA) | Public law |
3.5 Identity Documents (identity_document)
Data subject: The document holder.
Special considerations: Almost everything on an ID document is PI by design. The entire purpose of the document is identification. De-identification of an ID document essentially means redacting nearly all content.
What IS PI (detect):
| Item | POPIA Basis | Relevance | Notes |
|---|---|---|---|
| Full name, surname | Section 1(h) | primary | |
| ID number / passport number | Section 1(c) | primary | South African ID numbers encode DOB, gender, citizenship |
| Date of birth | Section 1(a) | primary | |
| Gender/sex | Section 1(a) | primary | |
| Nationality/citizenship | Section 1(a) | primary | |
| Country of birth | Section 1(a) | primary | |
| Photograph | Section 1(d) biometric | primary | |
| Signature | Section 1(d) biometric | primary | |
| Address (if on document) | Section 1(c) | primary | Driver's licences often include address |
| Barcode / MRZ data | Section 1(c) | primary | Machine-readable zone encodes all identification data |
| Driver's licence number | Section 1(c) | primary | |
| Vehicle restrictions/endorsements | Section 1(a) | primary | Could reveal medical conditions |
What is NOT PI (do not detect):
| Item | Rationale |
|---|---|
| "Republic of South Africa" | Issuing authority — public |
| Department of Home Affairs reference | Issuing authority |
| Document type label ("Identity Document", "Passport") | Structural |
| Security features, watermarks | Document integrity features |
3.6 General Documents (general)
Data subject: Varies — must be determined from context.
For documents that don't fit the above categories (letters, emails, mixed correspondence), apply the universal rules from Part 1:
- Identify who the data subject is
- Apply the POPIA Section 1 test to each potential PI item
- Check whether the item is information "relating to" the data subject
- Cite the specific POPIA section
- Classify as primary or secondary relevance
Part 4: Ground Truth File Construction Process
Step-by-step for each new document:
- Classify the document type using the categories in Part 2
- Identify the data subject(s) — who is this document about?
- Read the domain guidelines in Part 3 for that document type
- For each item on the document, apply the decision:
- Is it information relating to an identifiable person? → Which POPIA section?
- Is it PI of the data subject (primary) or someone else (secondary)?
- Is it institutional/public information? → Non-detection
- Is it methodology/template/structural? → Non-detection
- Build the JSON with every detection citing
popia_basisanddata_subject_relevance - Build the non-detections list — critical for regression testing. Over-redaction is as bad as under-detection
- Build context classifications — items that require domain-specific context to classify correctly
- Run against the regression suite — new ground truth must not conflict with existing ones
Ground truth JSON structure:
{
"document": "filename.pdf",
"document_category": "medical_health",
"compliance_standard": "strict_popia",
"total_pages": 6,
"data_subject": "Patient (natural person)",
"expected_detections": [
{
"value": "DOE, JOHN",
"category": "NAMES",
"context": "person_specific",
"popia_basis": "Section 1(h) — 'the name of the person'",
"data_subject_relevance": "primary",
"confidence": 0.95,
"pages": [1, 2, 3, 4, 5, 6],
"notes": "Patient name — appears on every page header"
}
],
"expected_non_detections": [
{
"value": "LABORATORY FOR MOLECULAR MEDICINE",
"reason": "institutional_letterhead — Testing laboratory name on standardised report header. Not PI of the patient. HIPAA Safe Harbor does not require removal of covered entity name."
}
],
"expected_context_classifications": [
{
"value": "c.244G>A",
"expected_context": "patient_finding",
"notes": "Patient's pathogenic variant — Section 26 genetic data"
}
],
"key_test_areas": [
"Description of what the regression test should specifically validate"
]
}
Part 5: Foreign Guidance Quick Reference
HIPAA §164.514(b) Safe Harbor — 18 Identifiers
Scope: Identifiers of "the individual or of relatives, employers, or household members of the individual." NOT the healthcare provider/institution.
Key exclusions relevant to SureDox:
- Physician/workforce/vendor names do NOT need removal (HHS FAQ 3.8)
- Covered entity's own name, address, phone do NOT need removal
- The 18 identifiers are: names, geographic data smaller than state, dates (except year), phone, fax, email, SSN, MRN, health plan number, account numbers, certificate/licence numbers, vehicle identifiers, device identifiers, URLs, IP addresses, biometric identifiers, full-face photos, and any other unique identifying number/characteristic/code
UK ICO Anonymisation Guidance (March 2025)
Key concept: "Motivated intruder" test — would a reasonably motivated person, without special knowledge, be able to re-identify a data subject from the remaining information?
Relevance to SureDox: Supports context-dependent assessment. The same item (e.g., a physician name) might or might not enable re-identification depending on how specialised the physician is.
POPIA's "Reasonably Foreseeable Method" Standard
POPIA Section 1 defines de-identification as deleting information that "can be used or manipulated by a reasonably foreseeable method to identify the data subject."
Interpretation (per Thaldar, PMC 2023): The identifiability of data subjects must be determined based on specific context. This aligns with the UK ICO's "reasonably likely" test and supports the view that institutional details on medical reports are not patient identifiers.
Part 6: Document Type Priority Roadmap
Current state (February 2026):
| # | Document Type | Ground Truth | Status |
|---|---|---|---|
| 1 | Legal agreement (Kettle Contract) | v2 | ✅ Complete — 11 detections, strict POPIA |
| 2 | Medical health (Genome Report) | v2 | ✅ Complete — 20 detections, strict POPIA. Professional names pending final decision. |
| 3 | Financial (bank statement) | — | ❌ Need test document |
| 4 | HR/employment (CV or contract) | — | ❌ Need test document |
| 5 | Identity document (ID/passport) | — | ❌ Need test document |
| 6 | General (letter/email) | — | ❌ Need test document |
Priority order for next ground truths:
- Financial — highest demand from compliance officers, clear PI patterns, tests merchant/institutional handling
- HR/employment — common use case, tests employment equity data (Section 26 special PI)
- Identity document — straightforward (almost everything is PI) but tests image-based/DocuPipe path
- General — catch-all, lowest priority, most variable
For each new document:
- Source a test document (real preferred, synthetic acceptable)
- Apply Part 3 guidelines to classify every item
- Build ground truth JSON following Part 4 structure
- Run regression suite — must pass ALL existing ground truths
- Add domain-specific classification prompt (implementation plan Steps 8-10)
Changelog
| Date | Change | Rationale |
|---|---|---|
| 28 Feb 2026 | Initial framework created | Standardise ground truth creation across document types |
| 28 Feb 2026 | Institutional letterhead excluded from medical detections | Not patient PI per HIPAA Safe Harbor, POPIA de-identification purpose test |
| 28 Feb 2026 | Professional names set at 0.50 confidence, secondary relevance | POPIA Section 1(h) applies but HIPAA explicitly excludes workforce names from de-identification requirement |