Back to Blog

SureDox PII Determination Framework

25 February 2026SureDox19 min readCompliance Framework

SureDox — PII Determination Framework

Purpose

This document provides a repeatable, legally grounded framework for determining what constitutes Personal Information (PI) on each document type SureDox processes. It serves as the authoritative reference when building ground truth files for regression testing.

Authority hierarchy:

  1. POPIA Section 1 (definitions) and Section 26 (special PI) — primary
  2. HIPAA §164.514(b) Safe Harbor — consulted for health data where POPIA is silent
  3. UK ICO Anonymisation Guidance (March 2025) — consulted for the "motivated intruder" / "reasonably foreseeable" test
  4. GDPR Article 4 — consulted for genetic data definition where POPIA is silent

Why foreign guidance is appropriate: POPIA contains no specific de-identification provisions beyond the definition in Section 1 and the exclusion in Section 6. South African legal scholars explicitly recommend consulting HIPAA §164.514(b) and the UK ICO code of practice where POPIA is silent (Swales, 2021, South African Journal of Science). The Information Regulator has not published de-identification guidance as of February 2026.


Part 1: Universal Rules (All Document Types)

These rules apply regardless of document category.

1.1 The POPIA Test

For any data item on any document, ask:

Is this information "relating to an identifiable, living, natural person" (or, where applicable, an identifiable, existing juristic person)? — POPIA Section 1

If YES → it is PI. The next question is: PI of whom?

1.2 Data Subject Identification

Every detection must identify the data subject — the person whose PI it is:

RelevanceMeaningDefault action
primaryPI of the document's subject (patient, signatory, account holder, employee)Redact by default
secondaryPI of another natural person appearing on the document (physician, witness, auditor)Flag for review
institutionalPI of a juristic person (company, lab, bank)Context-dependent

1.3 The De-identification Purpose Test

SureDox redacts documents for de-identification under POPIA Section 14(4). The purpose is to remove information that identifies the data subject. Therefore:

If an item is PI of someone OTHER than the data subject, it only needs redaction if it could be used as a re-identification pathway back to the data subject via a "reasonably foreseeable method" (POPIA Section 1 definition of "de-identify").

1.4 Universal PI Categories (Section 1)

These are always PI when they relate to an identifiable person:

POPIA ReferenceCategoryExamples
Section 1(h)NameFull name, initials, maiden name, alias
Section 1(c)Identifying numberID number, passport number, account number, MRN
Section 1(a)DemographicRace, gender, sex, age, DOB, nationality, language
Section 1(a)Physical/mental healthMedical conditions, diagnoses, treatment history
Section 1(b)EmploymentJob title, employer, salary, employment history
Section 1(c)Contact informationAddress, phone, email (when linked to a person)
Section 1(d)BiometricFingerprints, handwritten signatures, voiceprints
Section 1(e)FinancialIncome, assets, debts, credit history
Section 26Special PIRace, ethnic origin, trade union membership, political opinions, health, sex life, criminal behaviour, religious/philosophical beliefs, genetic data

1.5 Universal Non-PI Categories

These are NEVER PI regardless of document type:

CategoryExamplesRationale
Legislative references"Companies Act 71 of 2008", "POPIA Section 26"Public law, not information relating to any person
Structural elementsSection headings, page numbers, clause numbersDocument formatting
Generic role labels"the Client", "the Employee", "the Patient"Template references without identifying information
Publicly available regulatory dataCIPC registration (searchable on BizPortal), CLIA numbersPer SALRC guidance, public registry data
Geographic references without identifier functionArbitration venues, jurisdiction clausesLocation serving a procedural purpose, not identifying any person

1.6 The POPIA Basis Requirement

Every item in a ground truth expected_detections array must cite a specific POPIA section. If no section can be cited, the item does not belong in detections.


Part 2: Document Type Definitions

Document types SureDox supports:

CodeDocument TypeData SubjectPrimary Use Case
legal_agreementContracts, NDAs, service agreements, settlement agreementsSignatories (natural and juristic persons)Share agreements without exposing client/party identities
medical_healthLab reports, clinical notes, discharge summaries, prescriptions, pathology reportsPatientDe-identify for research, training, secondary use
financialBank statements, invoices, tax certificates, payslipsAccount holder / taxpayerShare financial records without exposing identity
hr_employmentCVs, employment contracts, disciplinary records, performance reviewsEmployee / candidateShare HR documents without exposing identity
identity_documentID books, passports, driver's licences, birth/death certificatesDocument holderVerify then redact for storage/sharing
generalLetters, emails, mixed correspondenceVariesCatch-all for documents that don't fit above categories

Part 3: Per-Domain PII Guidelines

3.1 Legal Agreements (legal_agreement)

Data subjects: The contracting parties (natural persons as signatories, juristic persons as entities).

What IS PI (detect):

ItemPOPIA BasisRelevanceNotes
Natural person names (signatories)Section 1(h)primaryNames of individuals who signed
Juristic person names (parties)Section 1 "where applicable"primaryCompany names as contracting parties — POPIA unlike GDPR protects juristic persons
Signing datesSection 1(c)primaryIdentifiers linked to identifiable persons in context of the signing event
Job titles of signatoriesSection 1(b)primaryEmployment information of the signatory
Handwritten signaturesSection 1(d)primaryBiometric information
Witness namesSection 1(h)secondaryPI of the witness, not the parties

What is NOT PI (do not detect):

ItemRationale
"the Client", "the Customer", "the Employer"Role labels — template references without identifying content
Section headings ("Preamble", "Definitions")Structural elements
Legislative references ("Companies Act 71 of 2008")Public law
AFSA, CCMA, other statutory bodiesPublic institutions, not parties to the agreement
Arbitration/jurisdiction venue locationsGeographic references serving procedural purpose, not identifying any person
Template text describing qualifications/credentials genericallyTemplate reference, not person-specific
Third-party software names (e.g. "Sage", "SAP")Public record / product names

Borderline cases:

ItemDecisionRationale
Company registration numbersDetect (primary)Section 1(c) identifier of the juristic person party
VAT numbers of partiesDetect (primary)Section 1(c) identifier
Physical addresses of partiesDetect (primary)Section 1(c) contact information of parties
Professional body memberships mentionedDo not detectTemplate reference unless naming a specific person

3.2 Medical / Health Documents (medical_health)

Data subject: The patient.

Special considerations: POPIA Section 26 classifies health data and genetic data as special personal information requiring heightened protection. HIPAA §164.514(b) Safe Harbor provides the most detailed guidance on which items must be removed for health data de-identification. South African scholars recommend consulting it where POPIA is silent.

What IS PI (detect):

ItemPOPIA BasisRelevanceNotes
Patient nameSection 1(h)primary
Date of birthSection 1(a) "birth"primary
Medical record number (MRN)Section 1(c)primaryDatabase key — immediate re-identification
Accession/specimen/family IDsSection 1(c)primaryLab system identifiers traceable to patient
Sex/genderSection 1(a) "sex"primary
Race/ethnicitySection 26 special PIprimaryHeightened protection required
Clinical indication/diagnosis specific to patientSection 1(a) + Section 26primary"Clinical diagnosis of DCM with arrhythmia" — patient-specific medical history
Patient-specific genetic variantsSection 26 genetic dataprimaryc.244G>A, p.Glu82Lys — patient's actual mutations
Genomic coordinates of patient variantsSection 26 genetic dataprimaryChr11:g.47364249G>A — molecular "GPS" to patient's mutation
Report/service dates tied to patientSection 1(c)primarySpecimen received, report issued — narrows re-identification window
Referring physician nameSection 1(h)secondaryPI of the physician. Low re-identification risk for patient unless highly specialised. Confidence 0.50
Report author/signatory namesSection 1(h)secondaryPI of lab staff. Very low re-identification risk — they sign hundreds of reports. Confidence 0.50

What is NOT PI (do not detect):

ItemRationaleSupporting authority
Lab/hospital name on letterheadInstitutional identity, not patient PI. Same letterhead appears on every report.HIPAA: no requirement to remove covered entity name
Lab address, phone, fax, URLPublicly available institutional contact detailsHIPAA §164.514(b): 18 identifiers scoped to "the individual", not the institution
Referring hospital/institution nameInstitutional reference, not patient PIHIPAA: referring facility not one of 18 identifiers
Gene names in coverage/panel tablesMethodology reference — the lab's test menu, not patient resultsScientific standards, not "information relating to" any person
Disease descriptions in general information sectionsGeneral medical knowledge, not patient-specific"Dilated Cardiomyopathy is characterised by..." describes the disease, not the patient
Methodology terms (GRCh37, "Whole genome sequencing", GATK)Scientific standards and lab procedures
Specimen types ("Blood, Peripheral")Methodology reference
Transcript IDs (NM_001122335)Public scientific database references
OMIM numbers, database referencesPublic scientific references
CLIA/CAP certification numbersPublicly available regulatory IDsSALRC: public registry data
Literature citations and author names in reference listsPublished academic work
Illumina, other vendor/platform namesProduct names

Decision tree for health documents:

Is this about the PATIENT specifically?
├── YES: Is it a name, ID, date, demographic, diagnosis, or genetic finding?
│   ├── YES → DETECT as primary PI (cite POPIA section)
│   └── NO → Evaluate: does it narrow re-identification via "reasonably foreseeable method"?
│       ├── YES → DETECT
│       └── NO → DO NOT DETECT
└── NO: Is it about the INSTITUTION (lab, hospital)?
    ├── YES → DO NOT DETECT (institutional letterhead, not patient PI)
    └── NO: Is it about a PROFESSIONAL (doctor, lab scientist)?
        ├── YES → PI of that person under Section 1(h)
        │   └── Could it create a re-identification pathway to the patient?
        │       ├── HIGH RISK (specialist with few patients) → DETECT at reduced confidence
        │       └── LOW RISK (signs hundreds of reports) → DETECT at 0.50, secondary
        └── NO: Is it METHODOLOGY or GENERAL KNOWLEDGE?
            └── YES → DO NOT DETECT

3.3 Financial Documents (financial)

Data subject: The account holder / taxpayer / payee.

What IS PI (detect):

ItemPOPIA BasisRelevanceNotes
Account holder nameSection 1(h)primary
ID numberSection 1(c)primary
Account numberSection 1(c)primary
Card numberSection 1(c)primaryAlso a security risk
Physical addressSection 1(c)primary
Phone number, emailSection 1(c)primary
Date of birth (if shown)Section 1(a)primary
Salary/income amounts (on payslips)Section 1(e)primaryFinancial information
Tax numberSection 1(c)primary
Transaction amounts and datesSection 1(e)primaryFinancial information tied to the account holder
Balance informationSection 1(e)primary

What is NOT PI (do not detect):

ItemRationale
Bank/institution name and brandingInstitutional identity — publicly known. Same header appears on millions of statements
Branch codeSemi-public — identifies the branch, not the person
Generic banking terms ("Monthly service fee", "Interest earned")Product/service descriptions
Exchange rates, interest ratesPublic financial data
Regulatory references (FICA, NCA)Public law
Bank contact details (call centre number, website)Public institutional information

Borderline cases:

ItemDecisionRationale
Merchant names in transactionsDo not detectPI of the merchant (juristic person), not the account holder. However, a pattern of merchant visits (e.g., pharmacy, specialist clinic) could reveal health information. Flag for user discretion in specific contexts.
Transaction descriptions with location infoDo not detect by default"WOOLWORTHS SANDTON" — merchant + location, not account holder PI. But patterns could narrow location.
Beneficiary names in transfersDetect (secondary)PI of another natural person — Section 1(h). Linked to account holder's financial activity.
Employer name on payslipDetect (primary)Section 1(b) employment information of the data subject

3.4 HR / Employment Documents (hr_employment)

Data subject: The employee or candidate.

What IS PI (detect):

ItemPOPIA BasisRelevanceNotes
Employee/candidate nameSection 1(h)primary
ID number / passport numberSection 1(c)primary
Date of birth, ageSection 1(a)primary
Physical addressSection 1(c)primary
Phone, emailSection 1(c)primary
Tax number, UIF numberSection 1(c)primary
Salary, benefits, bonus amountsSection 1(e)primary
Bank account details (for salary payments)Section 1(c) + (e)primary
Employment datesSection 1(b)primary
Job title, departmentSection 1(b)primary
Performance ratings/scoresSection 1(b)primary
Disciplinary findingsSection 1(b)primaryCould also be Section 26 (criminal behaviour)
Qualifications, schools, universitiesSection 1(b) + (a) "education"primary
Referee names and contact detailsSection 1(h) + (c)secondaryPI of the referee
Medical fitness certificatesSection 26 healthprimarySpecial PI
Race, gender, disability statusSection 26 + Section 1(a)primarySpecial PI — often collected for employment equity
Next of kin / emergency contact detailsSection 1(h) + (c)secondaryPI of another person

What is NOT PI (do not detect):

ItemRationale
Company name (employer) on letterheadInstitutional identity — public. Same letterhead on every employee's documents. Similar to lab letterhead rationale.
Generic job descriptions/dutiesTemplate text describing the role, not the person
Company policies referencedInstitutional documents
CCMA case references (case number only, no names)Procedural identifiers
Industry/sector classificationsGeneral information
Statutory references (BCEA, LRA, EEA)Public law

3.5 Identity Documents (identity_document)

Data subject: The document holder.

Special considerations: Almost everything on an ID document is PI by design. The entire purpose of the document is identification. De-identification of an ID document essentially means redacting nearly all content.

What IS PI (detect):

ItemPOPIA BasisRelevanceNotes
Full name, surnameSection 1(h)primary
ID number / passport numberSection 1(c)primarySouth African ID numbers encode DOB, gender, citizenship
Date of birthSection 1(a)primary
Gender/sexSection 1(a)primary
Nationality/citizenshipSection 1(a)primary
Country of birthSection 1(a)primary
PhotographSection 1(d) biometricprimary
SignatureSection 1(d) biometricprimary
Address (if on document)Section 1(c)primaryDriver's licences often include address
Barcode / MRZ dataSection 1(c)primaryMachine-readable zone encodes all identification data
Driver's licence numberSection 1(c)primary
Vehicle restrictions/endorsementsSection 1(a)primaryCould reveal medical conditions

What is NOT PI (do not detect):

ItemRationale
"Republic of South Africa"Issuing authority — public
Department of Home Affairs referenceIssuing authority
Document type label ("Identity Document", "Passport")Structural
Security features, watermarksDocument integrity features

3.6 General Documents (general)

Data subject: Varies — must be determined from context.

For documents that don't fit the above categories (letters, emails, mixed correspondence), apply the universal rules from Part 1:

  1. Identify who the data subject is
  2. Apply the POPIA Section 1 test to each potential PI item
  3. Check whether the item is information "relating to" the data subject
  4. Cite the specific POPIA section
  5. Classify as primary or secondary relevance

Part 4: Ground Truth File Construction Process

Step-by-step for each new document:

  1. Classify the document type using the categories in Part 2
  2. Identify the data subject(s) — who is this document about?
  3. Read the domain guidelines in Part 3 for that document type
  4. For each item on the document, apply the decision:
    • Is it information relating to an identifiable person? → Which POPIA section?
    • Is it PI of the data subject (primary) or someone else (secondary)?
    • Is it institutional/public information? → Non-detection
    • Is it methodology/template/structural? → Non-detection
  5. Build the JSON with every detection citing popia_basis and data_subject_relevance
  6. Build the non-detections list — critical for regression testing. Over-redaction is as bad as under-detection
  7. Build context classifications — items that require domain-specific context to classify correctly
  8. Run against the regression suite — new ground truth must not conflict with existing ones

Ground truth JSON structure:

{
  "document": "filename.pdf",
  "document_category": "medical_health",
  "compliance_standard": "strict_popia",
  "total_pages": 6,
  "data_subject": "Patient (natural person)",
  "expected_detections": [
    {
      "value": "DOE, JOHN",
      "category": "NAMES",
      "context": "person_specific",
      "popia_basis": "Section 1(h) — 'the name of the person'",
      "data_subject_relevance": "primary",
      "confidence": 0.95,
      "pages": [1, 2, 3, 4, 5, 6],
      "notes": "Patient name — appears on every page header"
    }
  ],
  "expected_non_detections": [
    {
      "value": "LABORATORY FOR MOLECULAR MEDICINE",
      "reason": "institutional_letterhead — Testing laboratory name on standardised report header. Not PI of the patient. HIPAA Safe Harbor does not require removal of covered entity name."
    }
  ],
  "expected_context_classifications": [
    {
      "value": "c.244G>A",
      "expected_context": "patient_finding",
      "notes": "Patient's pathogenic variant — Section 26 genetic data"
    }
  ],
  "key_test_areas": [
    "Description of what the regression test should specifically validate"
  ]
}

Part 5: Foreign Guidance Quick Reference

HIPAA §164.514(b) Safe Harbor — 18 Identifiers

Scope: Identifiers of "the individual or of relatives, employers, or household members of the individual." NOT the healthcare provider/institution.

Key exclusions relevant to SureDox:

  • Physician/workforce/vendor names do NOT need removal (HHS FAQ 3.8)
  • Covered entity's own name, address, phone do NOT need removal
  • The 18 identifiers are: names, geographic data smaller than state, dates (except year), phone, fax, email, SSN, MRN, health plan number, account numbers, certificate/licence numbers, vehicle identifiers, device identifiers, URLs, IP addresses, biometric identifiers, full-face photos, and any other unique identifying number/characteristic/code

UK ICO Anonymisation Guidance (March 2025)

Key concept: "Motivated intruder" test — would a reasonably motivated person, without special knowledge, be able to re-identify a data subject from the remaining information?

Relevance to SureDox: Supports context-dependent assessment. The same item (e.g., a physician name) might or might not enable re-identification depending on how specialised the physician is.

POPIA's "Reasonably Foreseeable Method" Standard

POPIA Section 1 defines de-identification as deleting information that "can be used or manipulated by a reasonably foreseeable method to identify the data subject."

Interpretation (per Thaldar, PMC 2023): The identifiability of data subjects must be determined based on specific context. This aligns with the UK ICO's "reasonably likely" test and supports the view that institutional details on medical reports are not patient identifiers.


Part 6: Document Type Priority Roadmap

Current state (February 2026):

#Document TypeGround TruthStatus
1Legal agreement (Kettle Contract)v2✅ Complete — 11 detections, strict POPIA
2Medical health (Genome Report)v2✅ Complete — 20 detections, strict POPIA. Professional names pending final decision.
3Financial (bank statement)❌ Need test document
4HR/employment (CV or contract)❌ Need test document
5Identity document (ID/passport)❌ Need test document
6General (letter/email)❌ Need test document

Priority order for next ground truths:

  1. Financial — highest demand from compliance officers, clear PI patterns, tests merchant/institutional handling
  2. HR/employment — common use case, tests employment equity data (Section 26 special PI)
  3. Identity document — straightforward (almost everything is PI) but tests image-based/DocuPipe path
  4. General — catch-all, lowest priority, most variable

For each new document:

  1. Source a test document (real preferred, synthetic acceptable)
  2. Apply Part 3 guidelines to classify every item
  3. Build ground truth JSON following Part 4 structure
  4. Run regression suite — must pass ALL existing ground truths
  5. Add domain-specific classification prompt (implementation plan Steps 8-10)

Changelog

DateChangeRationale
28 Feb 2026Initial framework createdStandardise ground truth creation across document types
28 Feb 2026Institutional letterhead excluded from medical detectionsNot patient PI per HIPAA Safe Harbor, POPIA de-identification purpose test
28 Feb 2026Professional names set at 0.50 confidence, secondary relevancePOPIA Section 1(h) applies but HIPAA explicitly excludes workforce names from de-identification requirement