Back to Blog

How SureDox De-identifies Medical Documents A POPIA-Grounded Methodology

2 March 2026SureDox22 min readMedical De-identification

How SureDox De-identifies Medical Documents: A POPIA-Grounded Methodology

Introduction

Medical documents present the most complex challenge in document redaction. A single genetic test report might contain the patient's name alongside hundreds of gene names, disease descriptions, laboratory details, and scientific references — all mixed together across multiple pages. Redacting too little exposes the patient. Redacting too much destroys the document's clinical and research value.

SureDox approaches this challenge with a structured methodology grounded in South African law, informed by international best practice where our law is silent. This document explains exactly how we determine what to redact on a medical document and why.


1. The Starting Point: Who Is the Data Subject?

Every de-identification exercise begins with a single question: whose personal information are we protecting?

On a medical document — whether it is a laboratory report, a discharge summary, a pathology result, or a prescription — the answer is almost always the patient. The patient is the data subject. The entire purpose of de-identifying a medical document is to remove information that identifies the patient, so the document can be used for secondary purposes such as research, training, quality assurance, peer review, or regulatory compliance — without exposing who the patient is.

This assumption shapes every decision that follows. When we encounter an item on a medical document — a name, an address, a date, a phone number — the question is not simply "is this personal information?" but rather "is this personal information of the patient, and does it contribute to identifying the patient?"

This distinction matters enormously. A laboratory's street address is personal information of the laboratory (a juristic person). But it tells you nothing about who the patient is. Redacting it achieves no privacy benefit for the patient while degrading the document's utility.


2. The Legal Foundation: POPIA's Definition of De-identification

South Africa's Protection of Personal Information Act 4 of 2013 (POPIA) provides the legal framework. Section 1 defines de-identification as follows:

"de-identify", in relation to personal information of a data subject, means to delete any information that — (a) identifies the data subject; (b) can be used or manipulated by a reasonably foreseeable method to identify the data subject; or (c) can be linked by a reasonably foreseeable method to other information that identifies the data subject

Three elements of this definition are critical for medical documents:

"identifies the data subject" — Direct identifiers like the patient's name, ID number, or medical record number. These identify the patient immediately and unambiguously. They must always be redacted.

"by a reasonably foreseeable method" — This is an objective legal standard. It asks whether a typical person with ordinary skills and access to publicly available information could use this data point to identify the patient. It does not require protection against every theoretically possible attack — only against reasonably foreseeable ones.

"linked ... to other information that identifies the data subject" — This captures quasi-identifiers: data points that do not identify the patient alone but could do so in combination. A date of birth by itself might not identify someone, but date of birth combined with gender, race, and a rare diagnosis could narrow the candidate pool to a single individual.

Swales (2021) explains the practical effect: "the typical researcher with the usual skills, expertise and knowledge of someone working in that field, should not be able to identify a data subject in the data set." This is not a theoretical exercise — it is a practical, context-dependent assessment.

Section 6: When POPIA No Longer Applies

Section 6 of POPIA provides that the Act does not apply to data "de-identified to the extent that it cannot be re-identified again." This sets a high bar — the de-identification must be effectively irreversible. For medical documents shared externally, this means removing sufficient identifiers that even someone with access to public databases and common investigative tools could not link the document back to the patient.

Section 26: Special Personal Information in Medical Context

POPIA Section 26 prohibits the processing of personal information concerning a data subject's health, sex life, or genetic makeup, except under specific authorisations set out in Sections 27 and 32. Section 32 is particularly relevant for medical documents: it permits processing of health data by medical professionals, healthcare institutions, and facilities where necessary for the proper treatment and care of the data subject, or where processing is necessary for reasons of public interest in the field of public health. The Information Regulator's Guidance Note on Special Personal Information (June 2021) confirms this framework and requires responsible parties to implement appropriate safeguards under Section 19.

For medical document de-identification, Section 26 is significant because it confirms that clinical findings, diagnoses, genetic variants, and health conditions that are specific to the patient are among the most sensitive categories of personal information. These are not just identifiers — they are the kind of information that POPIA considers most deserving of protection.

The Information Regulator's own Guidance Note on Processing of Special Personal Information (June 2021) confirms these categories and sets out the authorisation framework under Section 27(2) for responsible parties who need to process special PI. The guidance note requires that appropriate safeguards — including reasonable technical and organisational measures under Section 19 — be in place to protect this data. De-identification is one such safeguard: by removing the link between health data and the patient's identity, the data can be used for legitimate secondary purposes without exposing the patient.

However, it is important to distinguish between information about the patient's health and general medical knowledge that happens to appear on the same document. A genetic test report might contain both the patient's specific pathogenic variant (patient health data — Section 26) and a paragraph describing the general features of dilated cardiomyopathy (medical knowledge — not personal information of anyone). The former must be redacted; the latter need not be.


3. Where POPIA Is Silent: The Case for HIPAA Safe Harbor

POPIA defines what de-identification means. It does not prescribe how to achieve it.

As Swales (2021) observes: "In a South African context, other than the definition, POPIA does not contain any specific provision that deals with data de-identification directly. The term is mentioned in three sections of the Act (section 1, section 6, and section 14), but there is no specific guidance on how to achieve data de-identification, or any other detail in relation thereto."

The Information Regulator has not published de-identification guidance as of early 2026. The Regulator has issued guidance notes on processing special personal information (June 2021), processing children's information (June 2021), and processing personal information during COVID-19 (April 2020). The COVID-19 guidance note acknowledges de-identification as a legitimate data management tool — stating that a responsible party must "destroy or delete a record of personal information or de-identify it as soon as reasonably practicable" once no longer authorised to retain it. Yet none of these guidance notes address the practical question: how should an organisation de-identify health data? What specific items must be removed? The methodology gap remains unfilled.

The ASSAf POPIA Compliance Framework for Research, while valuable, provides general principles rather than specific rules for medical document de-identification.

Swales recommends a practical solution: "I suggest that a good rule of thumb to achieve de-identification in South Africa would be to ensure that the 18 identifiers of an individual set out in HIPAA are removed from the data set."

SureDox follows this recommendation. In the absence of POPIA-specific de-identification guidance, we adopt the HIPAA Safe Harbor method as our operational framework, adapted for the South African context and supplemented by the UK Information Commissioner's Office anonymisation guidance where appropriate.

The HIPAA Safe Harbor Method

The United States Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule at §164.514(b) provides two methods for de-identifying health information: Expert Determination and Safe Harbor. The Safe Harbor method requires removal of 18 specific identifiers, adapted here for the South African medical context:

#IdentifierSA Equivalent / Example
1NamesPatient name, maiden name, aliases
2Geographic data smaller than provinceStreet address, city, suburb, postal code (province-level may remain)
3All date elements except yearDate of birth (except year), admission date, discharge date, specimen date, report date. Ages over 89 aggregated to "90+"
4Telephone numbersPatient's phone numbers
5Fax numbersPatient's fax numbers
6Email addressesPatient's email
7Identity numbersSA ID number, passport number
8Medical record numbersHospital/clinic MRN, folder number
9Health plan beneficiary numbersMedical aid membership number, scheme number
10Account numbersHospital account number, billing reference
11Certificate/licence numbersDriver's licence, professional registration (when identifying the patient)
12Vehicle identifiersNot typically on medical documents
13Device identifiers/serial numbersImplant serial numbers, pacemaker IDs
14URLsPatient's personal web identifiers
15IP addressesIf present in electronic records
16Biometric identifiersFingerprints, retinal scans
17Full-face photographsPatient photos
18Any other unique identifying numberSpecimen IDs, accession numbers, family numbers, any code that serves as a database key to the patient's record

Critical Scoping: Identifiers of Whom?

A point that is often overlooked — and one that is essential for medical documents — is the scope of these 18 identifiers. The HIPAA Safe Harbor rule requires removal of identifiers "of the individual or of relatives, employers, or household members of the individual."

The "individual" is the patient. The rule does not require removal of identifiers belonging to the healthcare institution, the laboratory, the referring physician, or other professionals involved in the patient's care.

The US Department of Health and Human Services (HHS) has confirmed this explicitly. In its official Guidance on De-identification of Protected Health Information (November 2012), FAQ 3.8 states that there is no requirement to remove the names of healthcare providers or any workforce members from the dataset.

This means:

  • The laboratory name on a report header does not need to be removed
  • The laboratory address and phone number do not need to be removed
  • The referring physician's name does not need to be removed
  • The pathologist or report author's name does not need to be removed
  • The referring institution's name does not need to be removed

These are identifiers of the institution and its workforce — not of the patient. They appear identically on every report the institution produces, regardless of which patient the report concerns. Removing them achieves no meaningful de-identification benefit while significantly reducing the document's utility.


4. The UK ICO "Motivated Intruder" Test

The UK Information Commissioner's Office (ICO) anonymisation guidance introduces the "motivated intruder" test, which Swales (2021) recommends for borderline cases under POPIA.

The test asks: could a reasonably competent person, with access to all publicly available information and standard investigative techniques (but without specialist hacking skills or criminal methods), re-identify the patient from the remaining information?

This test is particularly useful for assessing items that fall between clear identifiers and clear non-identifiers. For example:

A referring physician's name on a rare disease report: If Dr. Smith is one of only three specialists in South Africa who treats a particular rare condition, and the de-identified report shows Dr. Smith ordered testing for that condition in January 2024, a motivated intruder who knows Dr. Smith's patient list could potentially narrow down the patient. In this case, the physician name creates a re-identification pathway.

A report author's name at a large laboratory: If Dr. Jones signed 500 reports in January 2024 at a national reference laboratory, knowing Dr. Jones was the signing pathologist tells you essentially nothing about which patient this report concerns. The re-identification risk is negligible.

SureDox applies the motivated intruder test as a secondary check on borderline items — particularly professional names, institutional details, and quasi-identifiers that could theoretically contribute to re-identification in specific contexts.


5. Applying the Framework: Medical Identifier Categories

With the legal foundation established, we can now classify every type of information that appears on a medical document into clear categories. For each category, we explain what it is, why it does or does not identify the patient, and what SureDox does with it.

5.1 Direct Patient Identifiers — Always Redact

These items identify the patient immediately. They correspond to HIPAA Safe Harbor identifiers 1-18 as scoped to the patient.

Patient name — The most fundamental identifier. Appears in headers, salutations, result summaries. Under POPIA Section 1(h), "the name of the person" is explicitly listed as personal information. Must be redacted wherever it appears.

Date of birth — Under POPIA Section 1(a), "birth" is explicitly listed. DOB combined with gender and any geographic indicator can uniquely identify most individuals. Georgetown University research has shown that 63% of Americans are uniquely identifiable from just gender, date of birth, and 5-digit ZIP code. South African demographics are comparable.

Identity number / passport number — POPIA Section 1(c): "any identifying number, symbol, e-mail address, physical address, telephone number, location information, online identifier or other particular assignment to the person." The SA ID number encodes date of birth, gender, and citizenship — it is a master identifier.

Medical record number (MRN) — The hospital or clinic's database key for this patient. Anyone with access to the healthcare system can enter this number and retrieve the patient's entire medical history. It functions as a direct lookup key.

Accession number / specimen ID / family number — Laboratory system identifiers that serve the same function as an MRN within the lab's information system. The accession number links to who ordered the test, which patient it was for, what the results were, and when it was processed. These are database keys traceable to the patient through the lab's chain-of-custody records.

Medical aid / health plan number — Identifies the patient within the medical aid scheme's system. Equivalent to HIPAA's "health plan beneficiary number."

Account / billing numbers — Hospital or lab account numbers linked to the patient.

5.2 Quasi-Identifiers — Redact (Narrowing Risk)

These items do not identify the patient alone but significantly narrow the candidate pool, especially in combination.

Gender / sex — POPIA Section 1(a) explicitly lists "sex." On its own, reduces the population by roughly half. Combined with other quasi-identifiers on the same document, contributes materially to re-identification.

Race / ethnicity — POPIA Section 26 classifies race as special personal information. Beyond the narrowing effect, its presence on a medical document warrants heightened protection.

Age (when over 89) — Following HIPAA guidance, ages over 89 should be aggregated to "90 or older" because very elderly patients are more readily identifiable due to smaller population sizes at extreme ages.

Dates of service — Specimen collection date, report date, admission date, discharge date. These are "elements of dates directly related to an individual" under HIPAA Safe Harbor identifier #3. A specimen received on 24 January 2014 at a specific laboratory, combined with the diagnosis and referring institution, narrows the patient to a very small pool.

5.3 Patient-Specific Clinical Findings — Redact (Section 26 Health/Genetic Data)

These are the patient's actual medical results — the clinical content that is specific to this individual patient, as opposed to general medical knowledge.

Clinical indication / reason for testing — Statements like "Clinical diagnosis and family history of dilated cardiomyopathy with arrhythmia." This is the patient's medical history — Section 1(a) "physical or mental health" and Section 26 health data. It is not a description of a disease in general; it is a description of what this patient has.

Patient-specific genetic variants — Coding DNA changes (e.g., c.244G>A), protein changes (e.g., p.Glu82Lys), and genomic coordinates (e.g., Chr11:g.47364249G>A) found in the patient's results section. These are POPIA Section 26 genetic data — "personal data relating to the inherited or acquired genetic characteristics of a natural person which give unique information about the physiology or health of that natural person" (borrowing the GDPR Article 4(13) definition for specificity, as POPIA does not define "genetic" separately).

A patient's specific pathogenic variant is analogous to a molecular fingerprint. It describes a change in that patient's DNA that may be shared by only a handful of individuals worldwide. Combined with the disease, demographic information, and institutional context, it can be highly identifying.

Carrier status findings — Incidental findings indicating the patient carries a variant predisposing to another condition (e.g., MUTYH-associated polyposis carrier). This has implications not only for the patient but for their biological relatives — carrier status is inheritable, which is precisely why genetic data receives heightened protection.

5.4 General Medical Knowledge — Do Not Redact

This is information that appears on the medical document but is not about the patient. It is medical knowledge, scientific methodology, or reference material that would be identical regardless of which patient the report concerns.

Disease descriptions in information sections — Paragraphs describing the general features, prevalence, inheritance pattern, and clinical management of a condition. "Dilated cardiomyopathy is characterised by ventricular dilation and impaired systolic function" describes the disease, not the patient. This text is identical on every report for every patient with the same condition. Redacting it removes valuable clinical context without any privacy benefit.

Gene names in coverage/panel tables — A genetic test report typically includes a table listing the genes tested and their coverage metrics. These gene names (e.g., BRAF, CALM1, AARS2) are the laboratory's test menu — they describe what the lab looked for, not what the lab found in this patient. They are scientific references, not personal information of anyone. Redacting a 335-gene coverage table would obliterate the report's methodological transparency.

Methodology terms — Reference genome builds (GRCh37, GRCh38), sequencing platforms (Illumina), analysis pipelines (GATK), specimen types ("Blood, Peripheral"), test names ("Whole genome sequencing"). These describe how the lab processed the specimen. They are laboratory procedures, not patient data.

Transcript and database identifiers — RefSeq transcript IDs (NM_001122335), OMIM numbers, ClinVar references. These are identifiers within public scientific databases. They identify genes or diseases in the global scientific literature, not patients.

Literature citations — Author names and references in bibliography sections. These are published works in the public domain.

5.5 Institutional and Professional Information — Do Not Redact

This is information about the healthcare providers and institutions involved in the patient's care. It is personal information — but of the institution or professional, not of the patient.

Laboratory / hospital name, address, phone, fax, URL — Institutional letterhead that appears identically on every report this institution produces. As discussed in Section 3 above, HIPAA explicitly confirms that covered entity details do not need to be removed. The same letterhead on a de-identified report tells you which lab processed the test; it does not tell you who the patient was.

Referring institution name — The hospital or clinic that referred the patient. While knowing the referring institution narrows the geographic pool, HIPAA does not list referring facility as one of the 18 identifiers. The re-identification risk through institutional name alone is low — a major hospital refers thousands of patients for testing.

Laboratory certification numbers (CLIA, CAP) — Publicly available regulatory identifiers of the laboratory. These are accessible through public registries and do not identify any patient.

Physician and laboratory staff names — The referring physician, report authors, and signing pathologists are natural persons whose names are personal information under POPIA Section 1(h) — of them. However, they are not personal information of the patient. HIPAA HHS FAQ 3.8 explicitly confirms there is no requirement to remove physician or workforce names for de-identification purposes.

SureDox acknowledges these as personal information of the professionals concerned but does not flag them as required redactions for patient de-identification. Users who wish to redact professional names for other reasons (e.g., the professional's own privacy preference) may do so at their discretion.


6. The Re-identification Risk Assessment

After applying the categories above, the final step is a holistic re-identification risk assessment. Even after removing all direct identifiers, could the remaining information — taken together — allow a motivated intruder to identify the patient?

Key factors to consider:

Rarity of condition — A common condition (e.g., hypertension) tested at a large hospital provides minimal re-identification risk from clinical content alone. A rare genetic condition tested at a specialist centre creates higher risk from clinical content — fewer patients have this condition, so each remaining data point narrows the pool more sharply.

Combination of quasi-identifiers — Gender + race + age range + rare diagnosis + referring institution + approximate date of service. Each item alone may be innocuous, but the combination can be identifying. This is why we redact quasi-identifiers (gender, race, dates) even though they seem general — it is the combination that creates risk.

Small population sizes — Reports involving rare diseases, very elderly patients, or small geographic areas require more aggressive de-identification because the baseline population is smaller.

Publicly available data — If patient registries, published case reports, or social media posts exist for a rare condition, the de-identified report's clinical details could potentially be matched against these public sources.

SureDox's approach is conservative: we redact all direct identifiers, all quasi-identifiers, and all patient-specific clinical findings. We preserve general medical knowledge, methodology, and institutional information. This balance ensures meaningful de-identification while maintaining the document's clinical and research utility.


7. Summary: What SureDox Redacts and What It Preserves

CategoryActionRationale
Patient name, ID number, MRNRedactDirect identifiers — POPIA Section 1(c), 1(h)
Date of birth, service datesRedactQuasi-identifiers — POPIA Section 1(a), 1(c); HIPAA Safe Harbor #3
Specimen ID, accession number, family numberRedactDatabase keys traceable to patient — POPIA Section 1(c); HIPAA Safe Harbor #18
Gender, race, ethnicityRedactQuasi-identifiers — POPIA Section 1(a), Section 26
Patient-specific diagnosis/clinical indicationRedactPatient health data — POPIA Section 1(a), Section 26
Patient-specific genetic variants and coordinatesRedactGenetic data — POPIA Section 26
Medical aid / account numbersRedactPatient identifiers — POPIA Section 1(c); HIPAA Safe Harbor #9, #10
Disease descriptions (general)PreserveGeneral medical knowledge, not patient-specific
Gene coverage tablesPreserveLaboratory methodology, not patient data
Methodology terms, reference genomesPreserveScientific standards
Transcript IDs, database referencesPreservePublic scientific references
Laboratory name, address, phone, URLPreserveInstitutional letterhead — not patient PI; HIPAA confirms no removal required
Referring institution namePreserveInstitutional reference — not one of HIPAA's 18 identifiers
Physician / lab staff namesPreservePI of the professional, not the patient; HIPAA HHS FAQ 3.8
Literature citationsPreservePublished public domain works
Certification numbers (CLIA, CAP)PreservePublicly available regulatory IDs

8. Legal References and Authority

Primary authority:

  • Protection of Personal Information Act 4 of 2013 (POPIA), Sections 1, 6, 14, 26, 32
  • Constitution of the Republic of South Africa, 1996, Section 14 (right to privacy)

Secondary authority (consulted where POPIA is silent on de-identification methodology):

  • HIPAA Privacy Rule, §164.514(a)-(c) — Safe Harbor method and 18 identifiers
  • HHS Guidance on De-identification of Protected Health Information (November 2012), including FAQ 3.8 on physician/workforce names
  • UK ICO Anonymisation Guidance (March 2025) — motivated intruder test, spectrum of identifiability

Information Regulator Guidance Notes:

  • Guidance Note on Processing of Special Personal Information (June 2021) — Confirms Section 26 categories; sets out Section 27(2) authorisation framework; requires Section 19 safeguards. Does not address de-identification methodology.
  • Guidance Note on Processing of Personal Information in the Management and Containment of COVID-19 (April 2020) — Acknowledges de-identification as a legitimate processing activity under POPIA; requires de-identification or destruction when retention is no longer authorised. Does not address de-identification methodology.
  • Guidance Note on Processing of Personal Information of Children (June 2021) — Reproduces de-identification definition; does not address methodology.

Academic authority:

  • Swales L. "The Protection of Personal Information Act and data de-identification." South African Journal of Science, 117(7/8), 2021. — Recommends HIPAA §164.514(b) and UK ICO guidance as interim standards where POPIA is silent.
  • Thaldar DW. "Does data protection law in South Africa apply to pseudonymised data?" PMC, 2023. — Analyses the "reasonably foreseeable method" standard and context-specific identifiability under POPIA.
  • ASSAf POPIA Compliance Framework for Research (2024) — Voluntary guidelines for research data handling.

Case law:

  • Divine Inspiration Trading 205 (Pty) Ltd v Gordon and Others (22455/2019) [2021] ZAWCHC 38 — Confirmed medical records constitute personal information under POPIA Section 1; held that POPIA yields to court process under Sections 12(2)(d)(iii) and 15(3)(c)(iii). (Note: this case addresses disclosure in litigation, not de-identification methodology, but confirms the classification of medical records as PI.)

This methodology is maintained by SureDox and reflects the state of POPIA guidance as of February 2026. SureDox will update this document as the Information Regulator publishes de-identification guidance or as relevant case law develops.

SureDox uses Anthropic's Claude AI for document analysis. All document processing occurs under strict security controls with no data retention by the AI provider. For details on our security practices, see our Security page.