Anonymizing protected health information

PHI is anonymized when patient data needs to be shared, analyzed, or used without compromising individual privacy, such as in research, public health reporting, regulatory compliance, and data sharing with third parties. The anonymization process enables organizations to leverage valuable healthcare data to improve patient care, advance medical research, and develop new healthcare products while ensuring patient identities remain protected.

HIPAA compliance and anonymizing PHI

To make health data usable for research, analytics, or public health while protecting privacy, HIPAA allows certain PHI to be transformed so it is no longer considered identifiable. According to HHS guidance, de-identified data is information that “does not identify an individual and there is no reasonable basis to believe that the information can be used to identify an individual.”

The rationale behind HIPAA’s de-identification framework is to strike a balance between privacy protection and data use. As the HHS explains, the goal is to “protect individual privacy while allowing important uses of health information.” By requiring identifiers to be removed or re-identification risks to be minimized, HIPAA enables the responsible secondary use of health information without compromising individual privacy.

Once PHI has been de-identified under HIPAA’s standards, it is no longer subject to the Privacy Rule’s protections, meaning it can be used and disclosed without the restrictions that apply to identifiable information.

Two approaches to de-identifying PHI

When it comes to anonymizing PHI, there are two primary methods recognized under HIPAA:

Safe harbor method

In the Safe Harbor method, “the following identifiers of the individual or of relatives, employers, or household members of the individual, are removed." These include:

Names
All geographic subdivisions smaller than a state (e.g., street addresses, city, county)
All elements of dates (except year) related to an individual (e.g., birthdate, admission date)
Telephone numbers
Email addresses
Social Security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/license numbers
Vehicle identifiers and serial numbers, including license plate numbers
Device identifiers and serial numbers
URLs
IP addresses
Biometric identifiers (e.g., fingerprints)
Full-face photographs and any comparable images
Any other unique identifying number, characteristic, or code

By removing these identifiers, the data is considered de-identified under HIPAA, meaning it is no longer subject to the regulations that apply to PHI.

Expert determination method

Under the expert determination method, a covered entity may de-identify PHI by relying on a qualified expert with “appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods.” According to HIPAA, this expert must apply those principles to determine that the “risk is very small” that the information could be used, either alone or in combination with other reasonably available data, to identify an individual.

Unlike the Safe Harbor method, this approach does not require the automatic removal of a fixed list of identifiers. Instead, it allows certain data elements to be retained if the expert determines they do not meaningfully increase the risk of re-identification. Moreover, HIPAA requires that the expert document both the methods used and the results of the analysis to justify the determination that the data is not individually identifiable.

This flexibility makes the expert determination method particularly valuable for research, analytics, and public health use cases where preserving data granularity is essential, provided that re-identification risks remain sufficiently low.

Techniques for anonymizing PHI

To effectively anonymize PHI, a variety of data masking techniques can be used. The study Anonymizing and Sharing Medical Text Records identifies some of these techniques:

Automatic information extraction

The study describes how modern systems focus on extracting identifiers such as names and dates, noting that “information extraction… extracts PHI and non-PHI attributes that could reveal patients’ identities, as well as health and medical information, from the text documents.”

This extraction is done through techniques such as:

Pattern matching, which uses rules or dictionaries to find common PHI elements, and
Machine learning classifiers, which can learn patterns to identify PHI even in varied and complex text.

Clustering related records

In the study, the authors describe a method where documents are organized using recursive binary partitioning, a technique that clusters records by similarity of medical concepts. Clustering informs covered entities as to how to anonymize data at a group level, improving the balance between privacy and practicality. The data is clustered “based on QID attributes, such as age and location, and then anonymize the QID values.”

Cluster-level value enumeration

Once the data has been extracted and clustered, anonymization can be applied in a way that retains its analytical value. In the framework presented in the study, explicit identifiers are removed, and potentially identifying information, especially quasi-identifiers like age or hospital location, is transformed using a value-enumeration method. This is to limit the “risk of reidentifying an individual from the released data.” This method aims to reduce re-identification risk while retaining meaningful data for analysis by replacing values at the cluster level instead of deleting them outright.

Hybrid and ensemble techniques

Hybrid approaches that combine rule-based detection with machine learning models help identify both expected and unexpected patterns of PHI in text. Pattern matching may flag known formats (like phone numbers), while machine learning models, such as support vector machines (SVM) or conditional random fields (CRF), learn from annotated data to recognize more subtle or ambiguous PHI instances.

Surrogate replacement

Another de-identification technique is replacing real identifiers with realistic surrogate values. For instance, names may be swapped with realistic alternatives, and dates shifted, preserving the format and flow of the medical narrative without exposing actual identifying data. These substitutions help maintain context for downstream research use.

By applying a blend of automated extraction, clustering, contextual classification, and value transformation, de-identification techniques transform raw clinical text into safer, research-ready datasets that minimize re-identification risk while preserving valuable clinical insights.

FAQs

What is protected health information (PHI)?

PHI refers to any information in a medical record or shared during a doctor-patient interaction that can be used to identify an individual. This includes names, addresses, birth dates, Social Security numbers, medical records, and more.

Go deeper: What is protected health information (PHI)?

What does it mean to anonymize PHI?

Anonymizing PHI means removing or altering personal identifiers in the data so that individuals cannot be readily identified. This process is designed to protect patient privacy while still allowing the data to be used for research, analysis, and other purposes.

Can anonymized data be re-identified?

In theory, anonymized data can be re-identified if sufficient additional information is available or if the anonymization process was not thorough. However, proper anonymization techniques should minimize this risk.

Anonymizing protected health information

HIPAA compliance and anonymizing PHI

Two approaches to de-identifying PHI

Safe harbor method

Expert determination method

Techniques for anonymizing PHI

Automatic information extraction

Clustering related records

Cluster-level value enumeration

Hybrid and ensemble techniques

Surrogate replacement

FAQs

What is protected health information (PHI)?

What does it mean to anonymize PHI?

Can anonymized data be re-identified?

Products

Resources

Company

Anonymizing protected health information

HIPAA compliance and anonymizing PHI

Two approaches to de-identifying PHI

Safe harbor method

Expert determination method

Techniques for anonymizing PHI

Automatic information extraction

Clustering related records

Cluster-level value enumeration

Hybrid and ensemble techniques

Surrogate replacement

FAQs

What is protected health information (PHI)?

What does it mean to anonymize PHI?

Can anonymized data be re-identified?

Related articles

Subscribe to Paubox Weekly