4 min read

Can de-identified data be used to train AI under HIPAA?

Gugu Ntsele January 15, 2026

HIPAA Compliance AI

Can de-identified data be used to train AI under HIPAA?

HIPAA restricts how covered entities can use and disclose protected health information (PHI). PHI includes any individually identifiable health information, from medical records and billing data to conversations between patients and providers.

However, once health information is properly de-identified, it is no longer considered PHI and falls outside HIPAA's regulations. According to 45 CFR § 164.514(a), "Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information." This creates a way for using healthcare data in AI development while maintaining patient privacy.

Learn more: When do you have to use deidentified data sets?

Two methods of data de-identification

HIPAA provides two methods for de-identifying data, each with its own requirements and use:

Expert determination method

The first is the Expert Determination method, which relies on statistical and scientific principles. Under this approach, a qualified expert applies statistical or scientific methods to determine that the risk of re-identifying an individual is very small. As specified in 45 CFR § 164.514(b)(1), the expert must determine that "the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information." The expert must also document "the methods and results of the analysis that justify such determination." This method offers flexibility and can be useful when dealing with datasets that contain valuable information that might otherwise need to be removed.

Safe harbor method

The second method is the Safe Harbor approach, which provides a checklist of 18 identifiers that must be removed from a dataset. These identifiers include names, geographic subdivisions smaller than a state, dates directly related to an individual (except year), telephone and fax numbers, email addresses, Social Security numbers, medical record numbers, account numbers, biometric identifiers, photographs, and any other unique identifying characteristics.

Additionally, under 45 CFR § 164.514(b)(2)(ii), "the covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information." This requirement explains that compliance is not only removing listed identifiers, organizations must also consider whether remaining data could be combined to re-identify individuals.

Why de-identified data matters for AI training

Machine learning models require large and different datasets to learn patterns and make accurate predictions. Healthcare data is good for developing AI tools that can diagnose diseases, predict patient outcomes, personalize treatments, and optimize healthcare operations.

By properly de-identifying data, healthcare organizations can contribute to AI innovation without risking patient privacy. A hospital could, for example, use de-identified electronic health records to train an AI model that predicts sepsis risk, or a health insurer could develop algorithms to identify potential fraud using de-identified claims data.

However, organizations must note the risks that generative AI systems have on data privacy. As noted in NIST's Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, generative artificial intelligence systems may "leak, generate, or correctly infer sensitive information about individuals," and models may reveal "sensitive information (from the public domain) that was included in training data." These risks show why it is important to have thorough de-identification practices when training AI models on healthcare data.

Practical challenges

Modern datasets are large and sometimes include indirect identifiers that could potentially be linked to individuals through external data sources. For example, a dataset showing a rare disease diagnosis in a small geographic area might be re-identifiable even without obvious identifiers.

The Safe Harbor method, while providing guidelines, can also remove so much information that the data loses its value for training AI. This means that organizations must balance privacy protection with data usability.

Re-identification risks also change as technology advances and more data becomes publicly available. Data that seemed de-identified five years ago might be vulnerable to re-identification today through advanced matching techniques or newly available datasets. This requires ongoing assessment and potential re-application of de-identification methods.

It is more difficult when dealing with unstructured data like clinical notes. As research on De-identification of free text data containing personal health information has found, no single approach could reliably de-identify all personal health identifying information in population data records. This shows that de-identification is not a one-size-fits-all process but requires layered approaches tailored to the specific characteristics of the data.

Best practices for healthcare organizations

Organizations seeking to use de-identified data for AI training should adopt a best practice approach to privacy protection:

Establish governance policies that define when and how de-identification will be used. These policies should specify who has authority to de-identify data, what methods will be employed, and how de-identified data will be managed throughout its lifecycle. As NIST's Generative AI Profile recommends, organizations should "establish transparency policies and processes for documenting the origin and history of training data."
Invest in expertise to ensure technical and regulatory compliance. Whether pursuing Expert Determination or Safe Harbor, organizations need individuals who understand the requirements and regulations.
Implement technical safeguards beyond the minimum requirements to provide additional layers of protection.
Maintain documentation of all de-identification processes. This documentation will help with showing compliance with HIPAA requirements, supporting quality assurance, and providing a basis for assessing whether data remains properly de-identified as circumstances change.
Consider the ethical side, not just legal compliance when working with patient data. Even when data is legally de-identified, organizations should think about patient expectations, potential harms from algorithmic bias, and transparency about how patient data contributes to AI development. The NIST framework notes that generative artificial intelligence can "increase the speed and scale at which harmful biases manifest.”
Conduct risk assessments when using generative AI systems. Organizations should engage in activities such as red-teaming, which NIST describes as helping to "identify potential adverse behavior or outcomes of a GAI model," and establish "procedures for the remediation of issues which trigger incident response processes."

FAQs

Can de-identified data ever become PHI again?

Yes, if de-identified data is later combined with other information in a way that enables re-identification, it may fall back under HIPAA scrutiny.

Does HIPAA require patient consent to use de-identified data for AI training?

No, HIPAA does not require patient authorization once data is properly de-identified.

Are business associates allowed to de-identify data on behalf of covered entities?

Yes, business associates may de-identify data if their agreements permit it and HIPAA requirements are followed.

How does HIPAA de-identification interact with state privacy laws like CPRA or GDPR?

State and international privacy laws may impose stricter standards than HIPAA, even for de-identified data.

Can de-identified data be shared or sold to third-party AI vendors?

HIPAA allows sharing of properly de-identified data, but contractual, ethical, and reputational risks still apply.

The impact of HIPAA's verification requirement on HIPAA compliant email

Subscribe to Paubox Weekly

Every Friday we bring you the most important news from Paubox. Our aim is to make you smarter, faster.