4 min read

Anonymizing protected health information

Anonymizing protected health information

PHI is anonymized when patient data needs to be shared, analyzed, or used without compromising individual privacy, such as in research, public health reporting, regulatory compliance, and data sharing with third parties. The anonymization process enables organizations to leverage valuable healthcare data to improve patient care, advance medical research, and develop new healthcare products while ensuring patient identities remain protected.

 

HIPAA compliance and anonymizing PHI

To make health data usable for research, analytics, or public health while protecting privacy, HIPAA allows certain PHI to be transformed so it is no longer considered identifiable. According to HHS guidance, de-identified data is information thatdoes not identify an individual and there is no reasonable basis to believe that the information can be used to identify an individual.

The rationale behind HIPAA’s de-identification framework is to strike a balance between privacy protection and data use. As the HHS explains, the goal is toprotect individual privacy while allowing important uses of health information.By requiring identifiers to be removed or re-identification risks to be minimized, HIPAA enables the responsible secondary use of health information without compromising individual privacy.

Once PHI has been de-identified under HIPAA’s standards, it is no longer subject to the Privacy Rule’s protections, meaning it can be used and disclosed without the restrictions that apply to identifiable information.

 

Two approaches to de-identifying PHI

When it comes to anonymizing PHI, there are two primary methods recognized under HIPAA:

Safe harbor method

In the Safe Harbor method,the following identifiers of the individual or of relatives, employers, or household members of the individual, are removed." These include:

  • Names
  • All geographic subdivisions smaller than a state (e.g., street addresses, city, county)
  • All elements of dates (except year) related to an individual (e.g., birthdate, admission date)
  • Telephone numbers
  • Email addresses
  • Social Security numbers
  • Medical record numbers
  • Health plan beneficiary numbers
  • Account numbers
  • Certificate/license numbers
  • Vehicle identifiers and serial numbers, including license plate numbers
  • Device identifiers and serial numbers
  • URLs
  • IP addresses
  • Biometric identifiers (e.g., fingerprints)
  • Full-face photographs and any comparable images
  • Any other unique identifying number, characteristic, or code

By removing these identifiers, the data is considered de-identified under HIPAA, meaning it is no longer subject to the regulations that apply to PHI.

Related: What are the 18 PHI identifiers?

 

Expert determination method

Under the expert determination method, a covered entity may de-identify PHI by relying on a qualified expert withappropriate knowledge of and experience with generally accepted statistical and scientific principles and methods.According to HIPAA, this expert must apply those principles to determine that therisk is very smallthat the information could be used, either alone or in combination with other reasonably available data, to identify an individual.

Unlike the Safe Harbor method, this approach does not require the automatic removal of a fixed list of identifiers. Instead, it allows certain data elements to be retained if the expert determines they do not meaningfully increase the risk of re-identification. Moreover, HIPAA requires that the expert document both the methods used and the results of the analysis to justify the determination that the data is not individually identifiable.

This flexibility makes the expert determination method particularly valuable for research, analytics, and public health use cases where preserving data granularity is essential, provided that re-identification risks remain sufficiently low.

 

Techniques for anonymizing PHI

To effectively anonymize PHI, a variety of data masking techniques can be used. The study Anonymizing and Sharing Medical Text Records identifies some of these techniques:

Automatic information extraction

The study describes how modern systems focus on extracting identifiers such as names and dates, noting thatinformation extraction… extracts PHI and non-PHI attributes that could reveal patients’ identities, as well as health and medical information, from the text documents.”

This extraction is done through techniques such as:

  • Pattern matching, which uses rules or dictionaries to find common PHI elements, and
  • Machine learning classifiers, which can learn patterns to identify PHI even in varied and complex text.

 

Clustering related records

In the study, the authors describe a method where documents are organized using recursive binary partitioning, a technique that clusters records by similarity of medical concepts. Clustering informs covered entities as to how to anonymize data at a group level, improving the balance between privacy and practicality. The data is clusteredbased on QID attributes, such as age and location, and then anonymize the QID values.”

 

Cluster-level value enumeration

Once the data has been extracted and clustered, anonymization can be applied in a way that retains its analytical value. In the framework presented in the study, explicit identifiers are removed, and potentially identifying information, especially quasi-identifiers like age or hospital location, is transformed using a value-enumeration method. This is to limit therisk of reidentifying an individual from the released data.This method aims to reduce re-identification risk while retaining meaningful data for analysis by replacing values at the cluster level instead of deleting them outright.

 

Hybrid and ensemble techniques

Hybrid approaches that combine rule-based detection with machine learning models help identify both expected and unexpected patterns of PHI in text. Pattern matching may flag known formats (like phone numbers), while machine learning models, such as support vector machines (SVM) or conditional random fields (CRF), learn from annotated data to recognize more subtle or ambiguous PHI instances.

 

Surrogate replacement

Another de-identification technique is replacing real identifiers with realistic surrogate values. For instance, names may be swapped with realistic alternatives, and dates shifted, preserving the format and flow of the medical narrative without exposing actual identifying data. These substitutions help maintain context for downstream research use.

By applying a blend of automated extraction, clustering, contextual classification, and value transformation, de-identification techniques transform raw clinical text into safer, research-ready datasets that minimize re-identification risk while preserving valuable clinical insights.

See also: HIPAA Compliant Email: The Definitive Guide

 

FAQs

What is protected health information (PHI)?

PHI refers to any information in a medical record or shared during a doctor-patient interaction that can be used to identify an individual. This includes names, addresses, birth dates, Social Security numbers, medical records, and more.

Go deeper: What is protected health information (PHI)?

 

What does it mean to anonymize PHI?

Anonymizing PHI means removing or altering personal identifiers in the data so that individuals cannot be readily identified. This process is designed to protect patient privacy while still allowing the data to be used for research, analysis, and other purposes.

 

Can anonymized data be re-identified?

In theory, anonymized data can be re-identified if sufficient additional information is available or if the anonymization process was not thorough. However, proper anonymization techniques should minimize this risk.

Subscribe to Paubox Weekly

Every Friday we bring you the most important news from Paubox. Our aim is to make you smarter, faster.